Configuration Reference¶
PhenoCluster is configured via a YAML file with nested sections. All parameters
have sensible defaults; only data.continuous_columns and/or
data.categorical_columns are strictly required.
Note
Default values shown below are from the base profile template. Python
dataclass defaults (used when instantiating PhenoClusterConfig
programmatically without a profile) may differ for some parameters.
Generate a starter config from any profile (see Configuration Profiles):
phenocluster create-config -p <profile> -o config.yaml
global¶
Project-level settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Project identifier shown in report titles and headers |
|
str |
|
Directory where all output files, plots, and cached artifacts are written |
|
int |
|
Global random seed, automatically propagated to model selection, data splitting, and feature selection for full reproducibility |
|
bool |
|
Whether to render the static HTML analysis report at the end of a run. JSON, CSV, and Plotly figure outputs are written either way. Can be overridden at the command line with |
data¶
Dataset schema and train/test splitting.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
list[str] |
|
Names of continuous (numeric) feature columns used for phenotype discovery |
|
list[str] |
|
Names of categorical (discrete) feature columns used for phenotype discovery |
|
str |
|
Splitting strategy. One of |
|
float |
|
Fraction of data held out for testing (0 to 1 exclusive). Used by |
|
str | null |
|
Column name to stratify the train/test split by (ensures balanced representation); |
|
bool |
|
Whether to shuffle the data before splitting. Used by |
|
str | null |
|
Name of the date/datetime column. Required when |
|
str |
|
Sub-strategy for |
|
str | null |
|
Cutoff timestamp; rows with |
|
float | null |
|
Fraction (0 to 1 exclusive) of the most recent rows used for validation. Required when |
|
int | null |
|
Number of validation windows (must be >= 2). Required when |
|
str | null |
|
Column whose values define the groups. Required for |
|
list | null |
|
Group values placed in the validation set; all other groups go to derivation. Required for |
|
int |
|
Minimum number of rows a validation cohort must contain; smaller cohorts raise an error |
Note
The train/test split is performed before any preprocessing. Imputation, outlier handling, encoding, and scaling are fit on the training set only for model selection. Once K is chosen, the full pipeline is refitted on the entire cohort for final analysis.
Note
Legacy configurations that omit split.strategy keep working unchanged:
random is the default and the only fields it consults are
test_size, stratify_by, and shuffle.
preprocessing.row_filter¶
Row-level missing data filtering, applied before any imputation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable row filtering |
|
float |
|
Maximum fraction of missing values allowed per row; rows exceeding this threshold are dropped |
preprocessing.imputation¶
Missing data imputation for remaining missing values after row filtering.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable imputation. When disabled, StepMix handles missing values natively via FIML |
|
str |
|
Imputation strategy: |
|
str |
|
Regression estimator for iterative imputation: |
|
int |
|
Maximum number of imputation rounds (iterative method only) |
preprocessing.categorical_encoding¶
Categorical variable encoding applied before LCA/LPA.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Encoding strategy: |
|
str |
|
Behaviour when encountering unseen categories at test time: |
preprocessing.outlier¶
Outlier detection and handling for continuous features.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable outlier handling |
|
str |
|
Strategy: |
|
float | “auto” |
|
Expected proportion of outliers in the data (isolation forest only); |
|
[float, float] |
|
Lower and upper percentile bounds for winsorization (e.g., |
preprocessing.feature_selection¶
Optional feature selection to reduce dimensionality before LCA/LPA.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable feature selection |
|
str |
|
Selection method: |
|
float |
|
Minimum variance required to keep a feature (variance method) |
|
float |
|
Drop features where a single value accounts for more than this fraction of observations |
|
float |
|
Maximum allowed pairwise Pearson correlation; one feature from each correlated pair is dropped |
|
int | null |
|
Target number of features to select (mutual info and lasso methods); |
|
float |
|
Percentile threshold for feature ranking (mutual info method) |
|
float | null |
|
L1 regularisation strength for lasso; |
|
str | null |
|
Target column name required by supervised methods ( |
|
bool |
|
When |
model¶
Latent Class / Profile Analysis model parameters and automatic selection.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Fixed number of latent classes; only used when |
|
bool |
|
Enable automatic model selection by searching over a range of cluster counts |
|
int |
|
Minimum number of clusters to evaluate during model selection |
|
int |
|
Maximum number of clusters to evaluate during model selection |
|
str |
|
Information criterion used to rank models: |
|
int | float |
|
Minimum acceptable cluster size; integer for absolute count, float in (0, 1) for proportion of total samples. Models with any cluster below this threshold are rejected |
|
list[int] |
|
Number of random EM initialisations per cluster count to avoid local optima |
|
int |
|
Number of parallel jobs; |
|
bool |
|
Refit best model on full training data after selection |
|
int |
|
Maximum number of EM algorithm iterations per fit |
|
float |
|
Absolute convergence tolerance for the EM log-likelihood |
|
float |
|
Relative convergence tolerance for the EM log-likelihood |
outcome¶
Binary outcome association analysis. When enabled, a logistic regression is fitted for each outcome column, comparing each phenotype against the reference.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable outcome association analysis |
|
list[str] |
|
Names of binary (0/1) outcome columns in the dataset |
stability¶
Consensus clustering stability analysis via repeated subsampled LCA/LPA fits.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable stability analysis |
|
int |
|
Number of subsampled LCA/LPA fits; higher values give more reliable stability estimates |
|
float |
|
Fraction of training data randomly sampled for each run |
|
int |
|
Number of parallel jobs; |
survival¶
Survival analysis with Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable survival analysis |
|
bool |
|
Weight survival curves by class membership probabilities instead of hard assignments |
|
list |
|
List of survival endpoints, each with |
multistate¶
Multistate transition modelling with transition-specific Cox PH models, hazard ratios, and Monte Carlo trajectory simulation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable multistate analysis |
|
list |
|
State definitions, each with |
|
list |
|
Allowed transitions between states, each with |
|
list[str] |
|
Column names of baseline covariates to adjust for in the Cox models |
|
int |
|
Minimum observed events required to fit a model for a given transition |
|
float |
|
Maximum follow-up horizon (in the same time unit as your data) for Monte Carlo simulation |
|
int |
|
Number of Monte Carlo patient trajectories simulated per phenotype |
|
list[float] |
|
Time points at which state occupation probabilities are evaluated |
|
int |
|
Safety limit on the maximum number of transitions in a single simulated trajectory |
inference¶
Statistical inference settings for outcome, survival, and multistate analyses.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable statistical inference (logistic regression, Cox PH, log-rank tests) |
|
float |
|
Width of confidence intervals (e.g., 0.95 for 95% CI) |
|
bool |
|
Apply Benjamini-Hochberg FDR correction for multiple comparisons |
|
str |
|
Test for binary outcomes: |
|
float |
|
L2 penalizer for Cox PH models (survival and multistate analyses); helps with convergence when events are sparse |
reference_phenotype¶
Strategy for selecting the reference phenotype against which all other phenotypes are compared in outcome and survival analyses.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Selection strategy: |
|
int | null |
|
Phenotype ID to use as reference; required when |
|
str | null |
|
Outcome column name used to determine the healthiest phenotype; required when |
external_validation¶
External validation on an independent cohort. When enabled, the fitted model is applied to an external dataset to assess phenotype reproducibility.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable external validation on an independent cohort |
|
str | null |
|
Path to the external cohort CSV file |
generalizability¶
Temporal and multi-site generalizability assessment (v0.3.0). See
Generalizability for the full reference; the table below lists the
top-level toggles. Sub-blocks temporal, multisite,
external_cohorts, calibration, drift, and
outcome_concordance are documented in detail there.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Run the generalizability stage. Requires at least one of |
|
|
|
For each in-CSV (derivation, validation) split, |
|
|
|
Whether to refit the feature selector per split. |
|
bool |
|
When |
|
int |
|
Skip the validation-side refit when the validation cohort has fewer rows than this. |
|
object | null |
|
Temporal split spec. Fields: |
|
object | null |
|
Multi-site split spec. Fields: |
|
list of objects |
|
Each entry: |
|
object |
|
Brier / ECE / reliability-curve settings (refit-and-match mode only). |
|
object |
|
Per-feature drift table (PSI, KS, chi-square). |
|
object |
|
Cross-cohort OR/HR concordance with FDR-corrected per-phenotype delta tests. |
cache¶
Artifact caching for incremental re-runs. Cached artifacts allow skipping completed pipeline steps when re-running with the same data and config.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable caching of intermediate pipeline results to |
|
int |
|
Gzip compression level for cached files (0 = no compression, 9 = maximum) |
visualization¶
Plot output settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Save generated plots to the output directory |
|
int |
|
Resolution in dots per inch for raster plot formats |
logging¶
Logging configuration.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Minimum log level: |
|
str |
|
Log message format: |
|
bool |
|
Write log messages to a file in the output directory |
|
str |
|
Name of the log file |
|
bool |
|
Suppress all console output (log file is still written if enabled) |
data_quality¶
Automated data quality assessment run before preprocessing.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable data quality checks |
|
float |
|
Flag columns with more than this fraction of missing values in the report |
|
float |
|
Flag feature pairs with Pearson correlation exceeding this value |
|
float |
|
Flag features with variance below this value as near-constant |
|
bool |
|
Include a data quality summary section in the HTML report |
categorical_flow¶
Categorical variable flow visualisation settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Group variables by name prefix for cleaner visualisation |
|
str |
|
Character used to split variable names into prefix groups |
|
dict |
|
Manual variable groupings (e.g., |
|
bool |
|
Show Sankey diagrams (can be cluttered with many variables) |
|
bool |
|
Show proportion heatmap (recommended) |
|
float |
|
Group categories below this proportion as “Other” |
feature_characterization¶
Descriptive feature characterisation settings for the report.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Group features by name prefix |
|
str |
|
Character used to split feature names into prefix groups |
|
dict |
|
Manual feature groupings (e.g., |
|
int |
|
Number of top features to show per group per cluster |
|
int |
|
Number of top features overall (when grouping is disabled) |