Generalizability¶

Version 0.3.0 introduces a first-class generalizability assessment that runs after the main pipeline. Three complementary cohort sources are supported:

Temporal generalizability - validate phenotypes derived on early data against later data via a fixed cutoff, a chronological fraction, or sliding/expanding rolling windows.
Multi-site generalizability - validate phenotypes across hospitals or centers via leave-one-site-out (LOGO), holdout, or pairwise schemes.
External cohort CSVs - validate against one or more separate CSV files that share the derivation schema. Configured via generalizability.external_cohorts.

Training scope¶

The generalizability.training_scope setting controls how the model used for in-CSV split metrics is fit:

"per_split" (default): for every in-CSV split (temporal or multi-site), a fresh DataPreprocessor and StepMix model are fit on derivation rows only and then applied to the validation rows. Each cohort report carries fit_mode: per_split. The pipeline’s full-cohort model is left untouched for descriptive analyses elsewhere in the report.
"global": in-CSV splits are scored by the pipeline’s full-cohort model. Each cohort report carries fit_mode: global. Faster, but appropriate only when the full-cohort model is the intended evaluation reference (e.g., descriptive transport check rather than a held-out validation claim).

For external cohorts (separate CSV files) the global model is always used: the pipeline never saw those rows during training, so the global-model path is the correct evaluation reference regardless of training_scope.

Feature-selector scope¶

The feature selector is a separate concern from the model. By default (feature_selector_scope: auto) the per-split path also refits the feature selector on each split’s derivation rows when that is safe:

For unsupervised selectors (variance, correlation, mutual_info over inter-feature MI), the selector is refit on the derivation rows.
For supervised LASSO, the selector is refit when the target column is present in the derivation rows AND is not also one of the cross-cohort concordance outcomes (i.e., not an entry in outcome.outcome_columns or a survival time_column / event_column). When the target collides with an outcome, the global selector is reused and a warning is recorded on the cohort report; selecting features that explicitly predict the outcome we are then comparing across cohorts would be a separate leakage concern.

Each cohort’s report carries a feature_selector_mode field with one of per_split_refit, global_reused, global_reused_with_warning, or none (when feature selection is disabled). Override the default with feature_selector_scope: per_split to make any unsafe situation a hard error, or feature_selector_scope: global to always reuse the pipeline’s single selector.

Modes¶

Apply-only¶

Set refit: false to apply the derivation model to each validation cohort without refitting. This is the fastest mode and reports cohort size, log-likelihood, phenotype distribution, drift, and outcome concordance. Calibration is not computed in this mode (the predicted class label is, by construction, the argmax of the posterior).

Refit-and-match¶

Set refit: true to refit a fresh StepMix on the validation cohort with the same hyperparameters as the derivation model. The resulting cluster labels are aligned with the derivation labels using a padded Hungarian assignment (scipy.optimize.linear_sum_assignment), which gracefully handles unequal cluster counts (extra clusters become “unmatched novel”; missing ones become “unmatched absent”). ARI and NMI are computed on the raw (pre-alignment) labels because both metrics are permutation-invariant; alignment is used only for matched-accuracy and reporting.

Refit is automatically skipped (with a warning) when the validation cohort is smaller than min_validation_size_for_refit (default 100).

Metrics¶

Calibration¶

Brier score (one-vs-rest per class, plus mean).
Expected calibration error with 10 quantile bins by default.
Reliability curve data (per-bin confidence, accuracy, count).

Calibration is meaningful only in refit-and-match mode, where the refit-aligned labels serve as proxy ground truth.

Drift¶

A tidy per-feature table is computed against the derivation cohort:

PSI (Population Stability Index) using equal-mass bin edges fitted on the derivation distribution, with Laplace smoothing (default floor 0.5 / min(n_deriv, n_val)) so empty bins do not blow up the log-ratio.
Kolmogorov-Smirnov test for continuous variables.
Chi-square test for categorical variables (with unseen categories folded into a "<unseen>" bucket).
Missing-rate difference between cohorts.

Outcome concordance¶

For each outcome present in both cohorts, compare_outcomes extracts (log effect, SE) from the existing analyzers’ Wald CIs and reports:

Pearson r and Spearman rho across phenotypes.
Lin’s concordance correlation coefficient.
Sign agreement above a configurable absolute-effect floor.
Per-phenotype Wald delta test on log(OR)_d - log(OR)_v with pooled SE sqrt(SE_d^2 + SE_v^2), BH-FDR-corrected within the outcome family.

The same machinery applies to Cox HRs via compare_survival.

Configuration¶

generalizability:
  enabled: true
  training_scope: per_split        # per_split (default) | global
  refit: true                      # refit-and-match against the derivation-only fit
  min_validation_size_for_refit: 100
  temporal:
    time_column: admission_date
    scheme: cutoff                 # cutoff | fraction | sliding | expanding
    time_cutoff: "2020-12-31"
    # time_test_fraction: 0.2      # for scheme=fraction
    # n_windows: 3                 # for scheme=sliding|expanding
  multisite:
    site_column: center
    scheme: logo                   # logo | holdout | pairwise
    # holdout_sites: [SITE_A]      # for scheme=holdout
    min_site_size: 30
  external_cohorts:                # one or more separate CSVs
    - { path: ./cohort_B.csv, label: hospital_X, kind: site }
    - { path: ./cohort_2024.csv, label: era_2024, kind: temporal }
  calibration:        { enabled: true, n_bins: 10, strategy: quantile }
  drift:              { enabled: true, n_bins: 10, top_k: 20 }
  outcome_concordance: { enabled: true, fdr_method: bh, alpha: 0.05 }

At least one of temporal, multisite, or external_cohorts must be provided when enabled: true; otherwise the stage raises a configuration error.

Outputs¶

results/temporal_validation_results.json - one entry per temporal cohort.
results/multisite_validation_results.json - one entry per multi-site cohort.
results/external_cohorts_results.json - one entry per external CSV cohort listed under generalizability.external_cohorts.
results/generalizability_summary.json - aggregate ARI / PSI per kind, plus training_scope flag and mean_derivation_only_ari_to_global.
data/generalizability/cluster_distribution_<label>.csv and data/generalizability/drift_<label>.csv per cohort.
New plots under plots/ (cohort prevalence heatmaps, drift bar charts, OR concordance scatter, ARI forest).

Each cohort’s JSON entry carries:

fit_mode: per_split (default, for in-CSV splits) or global (external cohorts and the legacy permissive path).
derivation_only_ari: ARI between the fresh per-split derivation-only fit and the global full-cohort model on the same rows. A high value means the per-split fit recovers the same structure as the descriptive global model.
derivation_only_outcomes: ORs computed against the derivation-only labels, fed into the cross-cohort outcome concordance comparison.

A new “Generalizability” section is appended to the static HTML report when generalizability results are present.

Pitfalls¶

Refit instability with very small validation cohorts: the min_validation_size_for_refit guardrail (default 100) skips refit and emits a warning rather than producing meaningless ARI/NMI.
Time-column leakage: do not include the time column itself among continuous_columns or categorical_columns.
Missing values in the partition column (time_column or site_column) are excluded from the partitioning step and tracked in the result’s stratification_fallback_reason.