Generalizability¶
Version 0.3.0 introduces a first-class generalizability assessment that runs after the main pipeline. Three complementary cohort sources are supported:
Temporal generalizability - validate phenotypes derived on early data against later data via a fixed cutoff, a chronological fraction, or sliding/expanding rolling windows.
Multi-site generalizability - validate phenotypes across hospitals or centers via leave-one-site-out (LOGO), holdout, or pairwise schemes.
External cohort CSVs - validate against one or more separate CSV files that share the derivation schema. Configured via
generalizability.external_cohorts.
Training scope¶
The generalizability.training_scope setting controls how the model
used for in-CSV split metrics is fit:
"per_split"(default): for every in-CSV split (temporal or multi-site), a freshDataPreprocessorand StepMix model are fit on derivation rows only and then applied to the validation rows. Each cohort report carriesfit_mode: per_split. The pipeline’s full-cohort model is left untouched for descriptive analyses elsewhere in the report."global": in-CSV splits are scored by the pipeline’s full-cohort model. Each cohort report carriesfit_mode: global. Faster, but appropriate only when the full-cohort model is the intended evaluation reference (e.g., descriptive transport check rather than a held-out validation claim).
For external cohorts (separate CSV files) the global model is
always used: the pipeline never saw those rows during training, so
the global-model path is the correct evaluation reference regardless
of training_scope.
Feature-selector scope¶
The feature selector is a separate concern from the model. By default
(feature_selector_scope: auto) the per-split path also refits the
feature selector on each split’s derivation rows when that is safe:
For unsupervised selectors (
variance,correlation,mutual_infoover inter-feature MI), the selector is refit on the derivation rows.For supervised LASSO, the selector is refit when the target column is present in the derivation rows AND is not also one of the cross-cohort concordance outcomes (i.e., not an entry in
outcome.outcome_columnsor a survivaltime_column/event_column). When the target collides with an outcome, the global selector is reused and a warning is recorded on the cohort report; selecting features that explicitly predict the outcome we are then comparing across cohorts would be a separate leakage concern.
Each cohort’s report carries a feature_selector_mode field with one
of per_split_refit, global_reused,
global_reused_with_warning, or none (when feature selection is
disabled). Override the default with
feature_selector_scope: per_split to make any unsafe situation a
hard error, or feature_selector_scope: global to always reuse the
pipeline’s single selector.
Modes¶
Apply-only¶
Set refit: false to apply the derivation model to each validation
cohort without refitting. This is the fastest mode and reports cohort
size, log-likelihood, phenotype distribution, drift, and outcome
concordance. Calibration is not computed in this mode (the predicted
class label is, by construction, the argmax of the posterior).
Refit-and-match¶
Set refit: true to refit a fresh StepMix on the validation cohort
with the same hyperparameters as the derivation model. The resulting
cluster labels are aligned with the derivation labels using a padded
Hungarian assignment (scipy.optimize.linear_sum_assignment), which
gracefully handles unequal cluster counts (extra clusters become
“unmatched novel”; missing ones become “unmatched absent”). ARI and NMI
are computed on the raw (pre-alignment) labels because both metrics
are permutation-invariant; alignment is used only for matched-accuracy
and reporting.
Refit is automatically skipped (with a warning) when the validation
cohort is smaller than min_validation_size_for_refit (default 100).
Metrics¶
Calibration¶
Brier score (one-vs-rest per class, plus mean).
Expected calibration error with 10 quantile bins by default.
Reliability curve data (per-bin confidence, accuracy, count).
Calibration is meaningful only in refit-and-match mode, where the refit-aligned labels serve as proxy ground truth.
Drift¶
A tidy per-feature table is computed against the derivation cohort:
PSI (Population Stability Index) using equal-mass bin edges fitted on the derivation distribution, with Laplace smoothing (default floor
0.5 / min(n_deriv, n_val)) so empty bins do not blow up the log-ratio.Kolmogorov-Smirnov test for continuous variables.
Chi-square test for categorical variables (with unseen categories folded into a
"<unseen>"bucket).Missing-rate difference between cohorts.
Outcome concordance¶
For each outcome present in both cohorts, compare_outcomes
extracts (log effect, SE) from the existing analyzers’ Wald CIs
and reports:
Pearson r and Spearman rho across phenotypes.
Lin’s concordance correlation coefficient.
Sign agreement above a configurable absolute-effect floor.
Per-phenotype Wald delta test on
log(OR)_d - log(OR)_vwith pooled SEsqrt(SE_d^2 + SE_v^2), BH-FDR-corrected within the outcome family.
The same machinery applies to Cox HRs via compare_survival.
Configuration¶
generalizability:
enabled: true
training_scope: per_split # per_split (default) | global
refit: true # refit-and-match against the derivation-only fit
min_validation_size_for_refit: 100
temporal:
time_column: admission_date
scheme: cutoff # cutoff | fraction | sliding | expanding
time_cutoff: "2020-12-31"
# time_test_fraction: 0.2 # for scheme=fraction
# n_windows: 3 # for scheme=sliding|expanding
multisite:
site_column: center
scheme: logo # logo | holdout | pairwise
# holdout_sites: [SITE_A] # for scheme=holdout
min_site_size: 30
external_cohorts: # one or more separate CSVs
- { path: ./cohort_B.csv, label: hospital_X, kind: site }
- { path: ./cohort_2024.csv, label: era_2024, kind: temporal }
calibration: { enabled: true, n_bins: 10, strategy: quantile }
drift: { enabled: true, n_bins: 10, top_k: 20 }
outcome_concordance: { enabled: true, fdr_method: bh, alpha: 0.05 }
At least one of temporal, multisite, or external_cohorts
must be provided when enabled: true; otherwise the stage raises a
configuration error.
Outputs¶
results/temporal_validation_results.json- one entry per temporal cohort.results/multisite_validation_results.json- one entry per multi-site cohort.results/external_cohorts_results.json- one entry per external CSV cohort listed undergeneralizability.external_cohorts.results/generalizability_summary.json- aggregate ARI / PSI per kind, plustraining_scopeflag andmean_derivation_only_ari_to_global.data/generalizability/cluster_distribution_<label>.csvanddata/generalizability/drift_<label>.csvper cohort.New plots under
plots/(cohort prevalence heatmaps, drift bar charts, OR concordance scatter, ARI forest).
Each cohort’s JSON entry carries:
fit_mode:per_split(default, for in-CSV splits) orglobal(external cohorts and the legacy permissive path).derivation_only_ari: ARI between the fresh per-split derivation-only fit and the global full-cohort model on the same rows. A high value means the per-split fit recovers the same structure as the descriptive global model.derivation_only_outcomes: ORs computed against the derivation-only labels, fed into the cross-cohort outcome concordance comparison.
A new “Generalizability” section is appended to the static HTML report when generalizability results are present.
Pitfalls¶
Refit instability with very small validation cohorts: the
min_validation_size_for_refitguardrail (default 100) skips refit and emits a warning rather than producing meaningless ARI/NMI.Time-column leakage: do not include the time column itself among
continuous_columnsorcategorical_columns.Missing values in the partition column (
time_columnorsite_column) are excluded from the partitioning step and tracked in the result’sstratification_fallback_reason.