Create Config Command ===================== The ``create-config`` command generates a template configuration file that you can customize for your data. Usage ----- .. code-block:: bash respredai create-config Options ------- Required ~~~~~~~~ - ``output_path`` - Path where the template configuration file will be created - Must end with ``.ini`` extension - Parent directory must exist or be creatable - File will be overwritten if it already exists Description ----------- This command creates a ready-to-use configuration template with all required sections pre-populated and inline comments explaining each parameter. The generated template follows the INI format required by the ``run`` command. Generated Template ------------------ The command creates a file with the following structure: .. code-block:: ini [Data] data_path = ./data/my_data.csv targets = Target1,Target2 continuous_features = Feature1,Feature2 [Pipeline] # Available models: LR, MLP, XGB, RF, CatBoost, TabPFN, RBF_SVC, Linear_SVC, KNN models = LR,XGB,RF outer_folds = 5 inner_folds = 3 outer_cv_repeats = 1 calibrate_threshold = false threshold_method = auto calibrate_probabilities = false probability_calibration_method = sigmoid probability_calibration_cv = 5 [Reproducibility] seed = 42 [Log] # Verbosity levels: 0 = no log, 1 = basic logging, 2 = detailed logging verbosity = 1 log_basename = respredai.log [Resources] # Number of parallel jobs (-1 uses all available cores) n_jobs = -1 [ModelSaving] # Enable model saving for resuming interrupted runs enable = true # Compression level for saved models (1-9, higher = more compression but slower) compression = 3 [Output] out_folder = ./output/ Customization Steps ------------------- After generating the template, customize it for your data: 1. Update Data Section ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [Data] data_path = ./path/to/your/data.csv targets = AntibioticA,AntibioticB continuous_features = Feature1,Feature3,Feature4 # group_column = PatientID # Optional - **data_path**: Path to your CSV file - **targets**: Comma-separated list of target columns (binary classification) - **continuous_features**: Features to scale with StandardScaler (all others are one-hot encoded) - **group_column** (optional): Column name for grouping multiple samples from the same patient/subject to prevent data leakage 2. Select Models ~~~~~~~~~~~~~~~~ .. code-block:: ini [Pipeline] models = LR,RF,XGB,CatBoost Use ``respredai list-models`` to see all available models. 3. Configure Cross-Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini outer_folds = 5 # For model evaluation inner_folds = 3 # For hyperparameter tuning - **outer_folds**: Number of folds for performance evaluation - **inner_folds**: Number of folds for GridSearchCV hyperparameter tuning 4. Configure Threshold Optimization (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini calibrate_threshold = true threshold_method = auto threshold_objective = youden vme_cost = 1.0 me_cost = 1.0 - **calibrate_threshold**: Enable decision threshold optimization - ``true``: Calibrate threshold using the specified objective - ``false``: Use default threshold of 0.5 - **threshold_method**: Method for threshold optimization (only used when ``calibrate_threshold = true``) - ``auto``: Automatically choose based on sample size (OOF if n < 1000, CV otherwise) - ``oof``: Out-of-fold predictions method - aggregates predictions from all CV folds into a single set, then finds one global threshold across all concatenated samples - ``cv``: TunedThresholdClassifierCV method - calculates optimal threshold separately for each CV fold, then aggregates (averages) the fold-specific thresholds - **Key difference**: ``oof`` finds one threshold on all concatenated OOF predictions (global optimization), while ``cv`` finds per-fold thresholds then averages them (fold-wise optimization then aggregation) - **threshold_objective**: Objective function for threshold optimization - ``youden``: Maximize Youden's J statistic (Sensitivity + Specificity - 1) - balanced approach - ``f1``: Maximize F1 score - balances precision and recall - ``f2``: Maximize F2 score - prioritizes recall over precision (reduces VME at potential cost of increased ME) - ``cost_sensitive``: Minimize weighted error cost using ``vme_cost`` and ``me_cost`` - **vme_cost** / **me_cost**: Cost weights for cost-sensitive threshold optimization - VME (Very Major Error): Predicted susceptible when actually resistant - ME (Major Error): Predicted resistant when actually susceptible - Higher ``vme_cost`` relative to ``me_cost`` will shift threshold to reduce false susceptible predictions 5. Configure Repeated Cross-Validation (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini outer_cv_repeats = 3 - **outer_cv_repeats**: Number of times to repeat outer cross-validation (default: 1) - ``1``: Standard (non-repeated) cross-validation - ``>1``: Repeated CV with different random shuffles for more robust estimates - Total iterations = ``outer_folds`` × ``outer_cv_repeats`` - Example: 5 folds × 3 repeats = 15 total train/test iterations 6. Configure Probability Calibration (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini calibrate_probabilities = true probability_calibration_method = sigmoid probability_calibration_cv = 5 - **calibrate_probabilities**: Enable post-hoc probability calibration - ``true``: Apply CalibratedClassifierCV to calibrate predicted probabilities - ``false``: Use uncalibrated probabilities (default) - Applied after Applied after hyper-parameters tuning and before threshold tuning - **probability_calibration_method**: Calibration method - ``sigmoid``: Platt scaling - fits logistic regression (default, works well for most cases) - ``isotonic``: Isotonic regression - non-parametric (requires more data) - **probability_calibration_cv**: CV folds for calibration (default: 5) - Internal cross-validation used by CalibratedClassifierCV - Must be at least 2 **Note**: Calibration diagnostics (Brier Score, ECE, MCE, reliability curves) are always computed regardless of this setting. 8. Configure Uncertainty Quantification (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [Uncertainty] margin = 0.1 - **margin**: Margin around the decision threshold for flagging uncertain predictions (0-0.5) - Predictions with probability within ``margin`` of the threshold are flagged as uncertain - Default: 0.1 - Uncertainty scores and flags are included in evaluation output - **Uncertainty score computation**: .. code-block:: text distance = |probability - threshold| max_distance = max(threshold, 1 - threshold) uncertainty = 1 - (distance / max_distance) is_uncertain = distance < margin - Score ranges from 0 (confident, at probability extremes) to 1 (uncertain, at threshold) - When threshold is calibrated, uncertainty is computed relative to the calibrated threshold 9. Configure Preprocessing (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [Preprocessing] ohe_min_frequency = 0.05 - **ohe_min_frequency**: Minimum frequency for categorical values in OneHotEncoder - Categories appearing below this threshold are grouped into an "infrequent" category - Values in (0, 1): proportion of samples (e.g., 0.05 = at least 5% of samples) - Values >= 1: absolute count (e.g., 10 = at least 10 occurrences) - Omit or comment out to disable (keep all categories) - Useful for reducing noise from rare categorical values and preventing overfitting 10. Adjust Resources ~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [Resources] n_jobs = -1 # Use all cores - ``-1``: Use all available CPU cores - ``1``: No parallelization (useful for debugging) - ``N``: Use N cores 11. Configure Model Saving ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [ModelSaving] enable = true compression = 3 - **enable**: Set to ``true`` to save models every folds - **compression**: 0-9 (0=no compression, 3=balanced, 9=maximum) 12. Set Output Location ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: ini [Output] out_folder = ./results/ The folder will be created if it doesn't exist. See Also -------- - :doc:`run-command` - Execute the nested CV pipeline - :doc:`train-command` - Train models on entire dataset for cross-dataset validation - :doc:`validate-config-command` - Validate configuration before running