Run Command#

The run command executes the full machine learning pipeline with nested cross-validation for antimicrobial resistance prediction.

Usage#

respredai run --config <path_to_config.ini> [options]

Options#

Required#

  • --config, -c - Path to the configuration file (INI format)

Optional#

  • --quiet, -q - Suppress banner and progress output

    • Does not suppress error messages or logs

CLI Overrides#

Override configuration file parameters without editing the file:

  • --models, -m - Override models (comma-separated)

    • Example: --models LR,RF,XGB

  • --targets, -t - Override targets (comma-separated)

    • Example: --targets Target1,Target2

  • --output, -o - Override output folder

    • Example: --output ./new_results/

  • --seed, -s - Override random seed

    • Example: --seed 123

Examples with overrides:

# Run with different models
respredai run --config my_config.ini --models LR,RF

# Run only specific targets with a different output folder
respredai run --config my_config.ini --targets Target1 --output ./experiment1/

# Quick experiment with different seed
respredai run --config my_config.ini --seed 42 --quiet

Configuration File#

The configuration file uses INI format with the following sections:

[Data] Section#

Defines the input data and features.

[Data]
data_path = ./data/my_data.csv
targets = Target1,Target2,Target3
continuous_features = Age,Weight,Temperature

Parameters:

  • data_path - Path to CSV file containing the dataset

    • Must include all features and target columns

    • First column is assumed to be the sample ID

  • targets - Comma-separated list of target column names

    • Each target will be trained separately

    • Must exist in the CSV file

  • continuous_features - Comma-separated list of continuous feature names

    • These features will be scaled using StandardScaler

    • All other features are treated as categorical and one-hot encoded

  • group_column (optional) - Column name for grouping related samples

    • Use when you have multiple samples from the same patient/subject

    • Prevents data leakage by keeping all samples from the same group in the same fold

    • Enables StratifiedGroupKFold for both outer and inner cross-validation (if not specified, standard StratifiedKFold is used)

    • See details in Create Config Command

[Pipeline] Section#

Controls the machine learning pipeline configuration.

[Pipeline]
models = LR,RF,XGB,CatBoost
outer_folds = 5
inner_folds = 3
outer_cv_repeats = 1
calibrate_threshold = false
threshold_method = auto
calibrate_probabilities = false
probability_calibration_method = sigmoid
probability_calibration_cv = 5

Parameters:

  • models - Comma-separated list of models to train

    • Available models: LR, MLP, XGB, RF, CatBoost, TabPFN, RBF_SVC, Linear_SVC, KNN

    • Use respredai list-models to see all available models with descriptions

  • outer_folds - Number of folds for outer cross-validation

    • Used for model evaluation

  • inner_folds - Number of folds for inner cross-validation

    • Used for hyperparameter tuning with GridSearchCV

  • calibrate_threshold - Enable decision threshold optimization (optional, default: false)

    • true: Optimize threshold using Youden’s J statistic (Sensitivity + Specificity - 1)

    • false: Use default threshold of 0.5

    • Threshold optimization uses inner_folds for cross-validation

    • Hyperparameters are tuned first (optimizing ROC-AUC), then threshold is optimized

  • threshold_method - Method for threshold optimization (optional, default: auto)

    • auto: Automatically choose based on sample size (OOF if n < 1000, CV otherwise)

    • oof: Out-of-fold predictions method - aggregates predictions from all CV folds into a single set, then finds one global threshold maximizing Youden’s J across all concatenated samples

    • cv: TunedThresholdClassifierCV method - calculates optimal threshold separately for each CV fold, then aggregates (averages) the fold-specific thresholds

    • Key difference: oof finds one threshold on all concatenated OOF predictions (global optimization), while cv finds per-fold thresholds then averages them (fold-wise optimization then aggregation)

    • Only used when calibrate_threshold = true

  • outer_cv_repeats - Number of repetitions for outer cross-validation (optional, default: 1)

    • 1: Standard (non-repeated) cross-validation

    • >1: Repeated stratified cross-validation with different random shuffles

    • Provides more robust performance estimates by averaging over multiple CV runs

  • calibrate_probabilities - Enable post-hoc probability calibration (optional, default: false)

    • true: Apply CalibratedClassifierCV to the best estimator from GridSearchCV

    • false: Use uncalibrated probability predictions

    • Applied after hyper-parameters tuning and before threshold tuning

  • probability_calibration_method - Method for probability calibration (optional, default: sigmoid)

    • sigmoid: Platt scaling - fits a logistic regression on the classifier outputs

    • isotonic: Isotonic regression - non-parametric, monotonic transformation

    • Only used when calibrate_probabilities = true

  • probability_calibration_cv - Number of folds for probability calibration (optional, default: 5)

    • CV folds used internally by CalibratedClassifierCV

    • Must be at least 2

    • Only used when calibrate_probabilities = true

[Reproducibility] Section#

Ensures reproducible results.

[Reproducibility]
seed = 42

Parameters:

  • seed - Random seed for reproducibility

    • Same seed ensures identical results across runs

    • Affects data splitting and model initialization

[Log] Section#

Controls logging behavior.

[Log]
verbosity = 1
log_basename = respredai.log

Parameters:

  • verbosity - Logging level

    • 0: No logging to file

    • 1: Log major events (model start/end, target completion)

    • 2: Verbose logging (includes fold-level details)

  • log_basename - Name of the log file

    • Created in the output folder

    • Contains detailed execution information

[Resources] Section#

Controls computational resources.

[Resources]
n_jobs = -1

Parameters:

  • n_jobs - Number of parallel jobs

    • -1: Use all available CPU cores

    • 1: No parallelization

    • N: Use N cores

[ModelSaving] Section#

Enables saving trained models for resumption.

[ModelSaving]
enable = true
compression = 3

Parameters:

  • enable - Enable saving trained models

    • true: Save models after each fold (enables resumption)

    • false: No model saving (faster but no resumption)

  • compression - Compression level for saved model files

    • Range: 0-9

    • 0: No compression (fastest, largest files)

    • 3: Balanced compression (recommended)

    • 9: Maximum compression (slowest, smallest files)

[Imputation] Section#

Controls missing data imputation (optional).

[Imputation]
method = none
strategy = mean
n_neighbors = 5
estimator = bayesian_ridge

Parameters:

  • method - Imputation method

    • none: No imputation (default, requires complete data)

    • simple: SimpleImputer from scikit-learn

    • knn: KNNImputer for k-nearest neighbors imputation

    • iterative: IterativeImputer (MissForest-style)

  • strategy - Strategy for SimpleImputer (only used when method = simple)

    • mean: Replace missing values with column mean (default)

    • median: Replace with column median

    • most_frequent: Replace with most frequent value

  • n_neighbors - Number of neighbors for KNNImputer (only used when method = knn)

    • Default: 5

  • estimator - Estimator for IterativeImputer (only used when method = iterative)

    • bayesian_ridge: BayesianRidge estimator (default)

    • random_forest: RandomForestRegressor (MissForest-style)

[Output] Section#

Specifies output location.

[Output]
out_folder = ./output/

Parameters:

  • out_folder - Path to output directory

    • Will be created if it doesn’t exist

    • Contains all results, metrics, and saved models

Pipeline Workflow#

The run command executes the following steps:

  1. Configuration Loading - Parse and validate the configuration file

  2. Data Loading - Read CSV and validate features/targets

  3. Preprocessing - One-hot encode categorical features, prepare data

  4. Nested Cross-Validation - For each model and target:

    • Outer CV Loop: Split data for evaluation

    • Inner CV Loop: Hyperparameter tuning with GridSearchCV

    • Training: Train best model on outer training fold

    • Evaluation: Test on outer test fold

    • Save Models: Save trained models and metrics (if enabled)

  5. Results Aggregation - Calculate mean and std across folds

  6. Output Generation - Save confusion matrices, metrics, and plots

Output Files#

The pipeline generates the following output structure:

output_folder/
├── models/                                       # Trained models (if model saving enabled)
│   ├── {Model}_{Target}_models.joblib            # Saved models for resumption
│   └── ...
├── metrics/                                      # Detailed metrics
│   ├── {target}/
│   │   ├── {model}_metrics_detailed.csv          # Comprehensive metrics with CI
│   │   └── summary.csv                           # Summary across all models for this target
│   └── summary_all.csv                           # Global summary across all models and targets
├── confusion_matrices/                           # Confusion matrix heatmaps
│   └── Confusion_matrix_{model}_{target}.png     # One PNG per model-target combination
├── calibration/                                  # Calibration diagnostics
│   └── reliability_curve_{model}_{target}.png    # Reliability curves per fold + aggregate
├── report.html                                   # Comprehensive HTML report
├── reproducibility.json                          # Reproducibility manifest
└── respredai.log                                 # Execution log (if verbosity > 0)

Metrics Files#

Each {model}_metrics_detailed.csv contains:

  • Metric: Name of the metric (Precision, Recall, F1, MCC, Balanced Acc, AUROC, VME, ME, Brier Score, ECE, MCE)

  • Mean: Mean value across folds

  • Std: Standard deviation across folds

  • CI95_lower: Lower bound of 95% confidence interval (bootstrap, 1,000 resamples)

  • CI95_upper: Upper bound of 95% confidence interval (bootstrap, 1,000 resamples)

Calibration Metrics (always computed, independent of probability calibration setting):

  • Brier Score: Mean squared error of probability predictions (lower is better, range 0-1)

  • ECE (Expected Calibration Error): Weighted average of calibration error across probability bins

  • MCE (Maximum Calibration Error): Maximum calibration error across any probability bin

Confusion Matrix Plots#

Each Confusion_matrix_{model}_{target}.png shows:

  • Normalized confusion matrix for a single model-target combination

  • Mean F1, MCC, and AUROC scores with standard deviations

  • Color-coded heatmap (0.0 = poor, 1.0 = perfect)

HTML Report#

The report.html file provides a comprehensive, self-contained summary:

  • Metadata: Configuration settings, data path, timestamp

  • Framework Summary: Pipeline parameters, models, targets, and calibration settings

  • Results Tables: Per-target metrics with 95% confidence intervals for each model

  • Confusion Matrices: Embedded visualizations in a responsive grid layout

  • Calibration Diagnostics: Brier Score, ECE, MCE metrics with 95% CIs, plus reliability curve plots

The report can be opened in any web browser and shared without additional dependencies.

Model Saving System#

Each {Model}_{Target}_models.joblib file contains all data from the outer cross-validation in a single file:

  • fold_models: A list containing one trained model per outer CV fold

  • fold_transformers: A list containing one fitted transformer (scaler) per fold

  • metrics: All metrics (precision, recall, F1, MCC, AUROC, confusion matrices) for every fold

  • completed_folds: Number of completed folds

  • timestamp: When the file was saved

For example, with outer_folds=5, each joblib file will contain 5 trained models and their corresponding transformers and metrics.

Examples#

Basic Usage#

respredai run --config my_config.ini

Quiet Mode (for scripts)#

respredai run --config my_config.ini --quiet

See Also#