Run Command =========== The ``run`` command executes the full machine learning pipeline with nested cross-validation for antimicrobial resistance prediction. Usage ----- .. code-block:: bash respredai run --config [options] Options ------- Required ~~~~~~~~ - ``--config, -c`` - Path to the configuration file (INI format) - Must be a valid, readable file - See `Configuration File`_ section for details Optional ~~~~~~~~ - ``--quiet, -q`` - Suppress banner and progress output - Does not suppress error messages or logs CLI Overrides ~~~~~~~~~~~~~ Override configuration file parameters without editing the file: - ``--models, -m`` - Override models (comma-separated) - Example: ``--models LR,RF,XGB`` - ``--targets, -t`` - Override targets (comma-separated) - Example: ``--targets Target1,Target2`` - ``--output, -o`` - Override output folder - Example: ``--output ./new_results/`` - ``--seed, -s`` - Override random seed - Example: ``--seed 123`` **Examples with overrides:** .. code-block:: bash # Run with different models respredai run --config my_config.ini --models LR,RF # Run only specific targets with a different output folder respredai run --config my_config.ini --targets Target1 --output ./experiment1/ # Quick experiment with different seed respredai run --config my_config.ini --seed 42 --quiet Configuration File ------------------ The configuration file uses INI format with the following sections: [Data] Section ~~~~~~~~~~~~~~ Defines the input data and features. .. code-block:: ini [Data] data_path = ./data/my_data.csv targets = Target1,Target2,Target3 continuous_features = Age,Weight,Temperature **Parameters:** - ``data_path`` - Path to CSV file containing the dataset - Must include all features and target columns - First column is assumed to be the sample ID - ``targets`` - Comma-separated list of target column names - Each target will be trained separately - Must exist in the CSV file - ``continuous_features`` - Comma-separated list of continuous feature names - These features will be scaled using StandardScaler - All other features are treated as categorical and one-hot encoded - ``group_column`` (optional) - Column name for grouping related samples - Use when you have multiple samples from the same patient/subject - Prevents data leakage by keeping all samples from the same group in the same fold - Enables ``StratifiedGroupKFold`` for both outer and inner cross-validation (if not specified, standard ``StratifiedKFold`` is used) - See details in :doc:`create-config-command` [Pipeline] Section ~~~~~~~~~~~~~~~~~~ Controls the machine learning pipeline configuration. .. code-block:: ini [Pipeline] models = LR,RF,XGB,CatBoost outer_folds = 5 inner_folds = 3 outer_cv_repeats = 1 calibrate_threshold = false threshold_method = auto calibrate_probabilities = false probability_calibration_method = sigmoid probability_calibration_cv = 5 **Parameters:** - ``models`` - Comma-separated list of models to train - Available models: ``LR``, ``MLP``, ``XGB``, ``RF``, ``CatBoost``, ``TabPFN``, ``RBF_SVC``, ``Linear_SVC``, ``KNN`` - Use ``respredai list-models`` to see all available models with descriptions - ``outer_folds`` - Number of folds for outer cross-validation - Used for model evaluation - ``inner_folds`` - Number of folds for inner cross-validation - Used for hyperparameter tuning with GridSearchCV - ``calibrate_threshold`` - Enable decision threshold optimization (optional, default: ``false``) - ``true``: Optimize threshold using Youden's J statistic (Sensitivity + Specificity - 1) - ``false``: Use default threshold of 0.5 - Threshold optimization uses ``inner_folds`` for cross-validation - Hyperparameters are tuned first (optimizing ROC-AUC), then threshold is optimized - ``threshold_method`` - Method for threshold optimization (optional, default: ``auto``) - ``auto``: Automatically choose based on sample size (OOF if n < 1000, CV otherwise) - ``oof``: Out-of-fold predictions method - aggregates predictions from all CV folds into a single set, then finds one global threshold maximizing Youden's J across all concatenated samples - ``cv``: TunedThresholdClassifierCV method - calculates optimal threshold separately for each CV fold, then aggregates (averages) the fold-specific thresholds - **Key difference**: ``oof`` finds one threshold on all concatenated OOF predictions (global optimization), while ``cv`` finds per-fold thresholds then averages them (fold-wise optimization then aggregation) - Only used when ``calibrate_threshold = true`` - ``outer_cv_repeats`` - Number of repetitions for outer cross-validation (optional, default: ``1``) - ``1``: Standard (non-repeated) cross-validation - ``>1``: Repeated stratified cross-validation with different random shuffles - Provides more robust performance estimates by averaging over multiple CV runs - ``calibrate_probabilities`` - Enable post-hoc probability calibration (optional, default: ``false``) - ``true``: Apply CalibratedClassifierCV to the best estimator from GridSearchCV - ``false``: Use uncalibrated probability predictions - Applied after hyper-parameters tuning and before threshold tuning - ``probability_calibration_method`` - Method for probability calibration (optional, default: ``sigmoid``) - ``sigmoid``: Platt scaling - fits a logistic regression on the classifier outputs - ``isotonic``: Isotonic regression - non-parametric, monotonic transformation - Only used when ``calibrate_probabilities = true`` - ``probability_calibration_cv`` - Number of folds for probability calibration (optional, default: ``5``) - CV folds used internally by CalibratedClassifierCV - Must be at least 2 - Only used when ``calibrate_probabilities = true`` [Reproducibility] Section ~~~~~~~~~~~~~~~~~~~~~~~~~ Ensures reproducible results. .. code-block:: ini [Reproducibility] seed = 42 **Parameters:** - ``seed`` - Random seed for reproducibility - Same seed ensures identical results across runs - Affects data splitting and model initialization [Log] Section ~~~~~~~~~~~~~ Controls logging behavior. .. code-block:: ini [Log] verbosity = 1 log_basename = respredai.log **Parameters:** - ``verbosity`` - Logging level - ``0``: No logging to file - ``1``: Log major events (model start/end, target completion) - ``2``: Verbose logging (includes fold-level details) - ``log_basename`` - Name of the log file - Created in the output folder - Contains detailed execution information [Resources] Section ~~~~~~~~~~~~~~~~~~~ Controls computational resources. .. code-block:: ini [Resources] n_jobs = -1 **Parameters:** - ``n_jobs`` - Number of parallel jobs - ``-1``: Use all available CPU cores - ``1``: No parallelization - ``N``: Use N cores [ModelSaving] Section ~~~~~~~~~~~~~~~~~~~~~ Enables saving trained models for resumption. .. code-block:: ini [ModelSaving] enable = true compression = 3 **Parameters:** - ``enable`` - Enable saving trained models - ``true``: Save models after each fold (enables resumption) - ``false``: No model saving (faster but no resumption) - ``compression`` - Compression level for saved model files - Range: 0-9 - ``0``: No compression (fastest, largest files) - ``3``: Balanced compression (recommended) - ``9``: Maximum compression (slowest, smallest files) [Imputation] Section ~~~~~~~~~~~~~~~~~~~~ Controls missing data imputation (optional). .. code-block:: ini [Imputation] method = none strategy = mean n_neighbors = 5 estimator = bayesian_ridge **Parameters:** - ``method`` - Imputation method - ``none``: No imputation (default, requires complete data) - ``simple``: SimpleImputer from scikit-learn - ``knn``: KNNImputer for k-nearest neighbors imputation - ``iterative``: IterativeImputer (MissForest-style) - ``strategy`` - Strategy for SimpleImputer (only used when ``method = simple``) - ``mean``: Replace missing values with column mean (default) - ``median``: Replace with column median - ``most_frequent``: Replace with most frequent value - ``n_neighbors`` - Number of neighbors for KNNImputer (only used when ``method = knn``) - Default: ``5`` - ``estimator`` - Estimator for IterativeImputer (only used when ``method = iterative``) - ``bayesian_ridge``: BayesianRidge estimator (default) - ``random_forest``: RandomForestRegressor (MissForest-style) [Output] Section ~~~~~~~~~~~~~~~~ Specifies output location. .. code-block:: ini [Output] out_folder = ./output/ **Parameters:** - ``out_folder`` - Path to output directory - Will be created if it doesn't exist - Contains all results, metrics, and saved models Pipeline Workflow ----------------- The ``run`` command executes the following steps: 1. **Configuration Loading** - Parse and validate the configuration file 2. **Data Loading** - Read CSV and validate features/targets 3. **Preprocessing** - One-hot encode categorical features, prepare data 4. **Nested Cross-Validation** - For each model and target: - **Outer CV Loop**: Split data for evaluation - **Inner CV Loop**: Hyperparameter tuning with GridSearchCV - **Training**: Train best model on outer training fold - **Evaluation**: Test on outer test fold - **Save Models**: Save trained models and metrics (if enabled) 5. **Results Aggregation** - Calculate mean and std across folds 6. **Output Generation** - Save confusion matrices, metrics, and plots Output Files ------------ The pipeline generates the following output structure: :: output_folder/ ├── models/ # Trained models (if model saving enabled) │ ├── {Model}_{Target}_models.joblib # Saved models for resumption │ └── ... ├── metrics/ # Detailed metrics │ ├── {target}/ │ │ ├── {model}_metrics_detailed.csv # Comprehensive metrics with CI │ │ └── summary.csv # Summary across all models for this target │ └── summary_all.csv # Global summary across all models and targets ├── confusion_matrices/ # Confusion matrix heatmaps │ └── Confusion_matrix_{model}_{target}.png # One PNG per model-target combination ├── calibration/ # Calibration diagnostics │ └── reliability_curve_{model}_{target}.png # Reliability curves per fold + aggregate ├── report.html # Comprehensive HTML report ├── reproducibility.json # Reproducibility manifest └── respredai.log # Execution log (if verbosity > 0) Metrics Files ~~~~~~~~~~~~~ Each ``{model}_metrics_detailed.csv`` contains: - **Metric**: Name of the metric (Precision, Recall, F1, MCC, Balanced Acc, AUROC, VME, ME, Brier Score, ECE, MCE) - **Mean**: Mean value across folds - **Std**: Standard deviation across folds - **CI95_lower**: Lower bound of 95% confidence interval (bootstrap, 1,000 resamples) - **CI95_upper**: Upper bound of 95% confidence interval (bootstrap, 1,000 resamples) **Calibration Metrics** (always computed, independent of probability calibration setting): - **Brier Score**: Mean squared error of probability predictions (lower is better, range 0-1) - **ECE** (Expected Calibration Error): Weighted average of calibration error across probability bins - **MCE** (Maximum Calibration Error): Maximum calibration error across any probability bin Confusion Matrix Plots ~~~~~~~~~~~~~~~~~~~~~~ Each ``Confusion_matrix_{model}_{target}.png`` shows: - Normalized confusion matrix for a single model-target combination - Mean F1, MCC, and AUROC scores with standard deviations - Color-coded heatmap (0.0 = poor, 1.0 = perfect) HTML Report ~~~~~~~~~~~ The ``report.html`` file provides a comprehensive, self-contained summary: - **Metadata**: Configuration settings, data path, timestamp - **Framework Summary**: Pipeline parameters, models, targets, and calibration settings - **Results Tables**: Per-target metrics with 95% confidence intervals for each model - **Confusion Matrices**: Embedded visualizations in a responsive grid layout - **Calibration Diagnostics**: Brier Score, ECE, MCE metrics with 95% CIs, plus reliability curve plots The report can be opened in any web browser and shared without additional dependencies. Model Saving System ------------------- Each ``{Model}_{Target}_models.joblib`` file contains all data from the outer cross-validation in a single file: - **fold_models**: A list containing one trained model per outer CV fold - **fold_transformers**: A list containing one fitted transformer (scaler) per fold - **metrics**: All metrics (precision, recall, F1, MCC, AUROC, confusion matrices) for every fold - **completed_folds**: Number of completed folds - **timestamp**: When the file was saved For example, with ``outer_folds=5``, each joblib file will contain 5 trained models and their corresponding transformers and metrics. Examples -------- Basic Usage ~~~~~~~~~~~ .. code-block:: bash respredai run --config my_config.ini Quiet Mode (for scripts) ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash respredai run --config my_config.ini --quiet See Also -------- - :doc:`train-command` - Train models on entire dataset for cross-dataset validation - :doc:`evaluate-command` - Apply trained models to new data - :doc:`feature-importance-command` - Extract and visualize feature importance - :doc:`validate-config-command` - Validate configuration files - :doc:`create-config-command` - How to create the configuration file