Configuration Reference¶
PhenoCluster is configured via a YAML file with nested sections. All parameters
have sensible defaults; only data.continuous_columns and/or
data.categorical_columns are strictly required.
Note
Default values shown below are from the base profile template. Python
dataclass defaults (used when instantiating PhenoClusterConfig
programmatically without a profile) may differ for some parameters.
Generate a starter config from any profile (see Configuration Profiles):
phenocluster create-config -p <profile> -o config.yaml
global¶
Project-level settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Project identifier shown in report titles and headers |
|
str |
|
Directory where all output files, plots, and cached artifacts are written |
|
int |
|
Global random seed, automatically propagated to model selection, data splitting, and feature selection for full reproducibility |
data¶
Dataset schema and train/test splitting.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
list[str] |
|
Names of continuous (numeric) feature columns used for phenotype discovery |
|
list[str] |
|
Names of categorical (discrete) feature columns used for phenotype discovery |
|
float |
|
Fraction of data held out for testing (0 to 1 exclusive) |
|
str | null |
|
Column name to stratify the train/test split by (ensures balanced representation); |
|
bool |
|
Whether to shuffle the data before splitting |
Note
The train/test split is performed before any preprocessing. Imputation, outlier handling, encoding, and scaling are fit on the training set only for model selection. Once K is chosen, the full pipeline is refitted on the entire cohort for final analysis.
preprocessing.row_filter¶
Row-level missing data filtering, applied before any imputation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable row filtering |
|
float |
|
Maximum fraction of missing values allowed per row; rows exceeding this threshold are dropped |
preprocessing.imputation¶
Missing data imputation for remaining missing values after row filtering.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable imputation. When disabled, StepMix handles missing values natively via FIML |
|
str |
|
Imputation strategy: |
|
str |
|
Regression estimator for iterative imputation: |
|
int |
|
Maximum number of imputation rounds (iterative method only) |
preprocessing.categorical_encoding¶
Categorical variable encoding applied before LCA/LPA.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Encoding strategy: |
|
str |
|
Behaviour when encountering unseen categories at test time: |
preprocessing.outlier¶
Outlier detection and handling for continuous features.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable outlier handling |
|
str |
|
Strategy: |
|
float | “auto” |
|
Expected proportion of outliers in the data (isolation forest only); |
|
[float, float] |
|
Lower and upper percentile bounds for winsorization (e.g., |
preprocessing.feature_selection¶
Optional feature selection to reduce dimensionality before LCA/LPA.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable feature selection |
|
str |
|
Selection method: |
|
float |
|
Minimum variance required to keep a feature (variance method) |
|
float |
|
Drop features where a single value accounts for more than this fraction of observations |
|
float |
|
Maximum allowed pairwise Pearson correlation; one feature from each correlated pair is dropped |
|
int | null |
|
Target number of features to select (mutual info and lasso methods); |
|
float |
|
Percentile threshold for feature ranking (mutual info method) |
|
float | null |
|
L1 regularisation strength for lasso; |
|
str | null |
|
Target column name required by supervised methods ( |
model¶
Latent Class / Profile Analysis model parameters and automatic selection.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Fixed number of latent classes; only used when |
|
bool |
|
Enable automatic model selection by searching over a range of cluster counts |
|
int |
|
Minimum number of clusters to evaluate during model selection |
|
int |
|
Maximum number of clusters to evaluate during model selection |
|
str |
|
Information criterion used to rank models: |
|
int | float |
|
Minimum acceptable cluster size; integer for absolute count, float in (0, 1) for proportion of total samples. Models with any cluster below this threshold are rejected |
|
list[int] |
|
Number of random EM initialisations per cluster count to avoid local optima |
|
int |
|
Number of parallel jobs; |
|
bool |
|
Refit best model on full training data after selection |
|
int |
|
Maximum number of EM algorithm iterations per fit |
|
float |
|
Absolute convergence tolerance for the EM log-likelihood |
|
float |
|
Relative convergence tolerance for the EM log-likelihood |
outcome¶
Binary outcome association analysis. When enabled, a logistic regression is fitted for each outcome column, comparing each phenotype against the reference.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable outcome association analysis |
|
list[str] |
|
Names of binary (0/1) outcome columns in the dataset |
stability¶
Consensus clustering stability analysis via repeated subsampled LCA/LPA fits.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable stability analysis |
|
int |
|
Number of subsampled LCA/LPA fits; higher values give more reliable stability estimates |
|
float |
|
Fraction of training data randomly sampled for each run |
|
int |
|
Number of parallel jobs; |
survival¶
Survival analysis with Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable survival analysis |
|
bool |
|
Weight survival curves by class membership probabilities instead of hard assignments |
|
list |
|
List of survival endpoints, each with |
multistate¶
Multistate transition modelling with transition-specific Cox PH models, hazard ratios, and Monte Carlo trajectory simulation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable multistate analysis |
|
list |
|
State definitions, each with |
|
list |
|
Allowed transitions between states, each with |
|
list[str] |
|
Column names of baseline covariates to adjust for in the Cox models |
|
int |
|
Minimum observed events required to fit a model for a given transition |
|
float |
|
Maximum follow-up horizon (in the same time unit as your data) for Monte Carlo simulation |
|
int |
|
Number of Monte Carlo patient trajectories simulated per phenotype |
|
list[float] |
|
Time points at which state occupation probabilities are evaluated |
|
int |
|
Safety limit on the maximum number of transitions in a single simulated trajectory |
inference¶
Statistical inference settings for outcome, survival, and multistate analyses.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable statistical inference (logistic regression, Cox PH, log-rank tests) |
|
float |
|
Width of confidence intervals (e.g., 0.95 for 95% CI) |
|
bool |
|
Apply Benjamini-Hochberg FDR correction for multiple comparisons |
|
str |
|
Test for binary outcomes: |
|
float |
|
L2 penalizer for Cox PH models (survival and multistate analyses); helps with convergence when events are sparse |
reference_phenotype¶
Strategy for selecting the reference phenotype against which all other phenotypes are compared in outcome and survival analyses.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Selection strategy: |
|
int | null |
|
Phenotype ID to use as reference; required when |
|
str | null |
|
Outcome column name used to determine the healthiest phenotype; required when |
external_validation¶
External validation on an independent cohort. When enabled, the fitted model is applied to an external dataset to assess phenotype reproducibility.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable external validation on an independent cohort |
|
str | null |
|
Path to the external cohort CSV file |
cache¶
Artifact caching for incremental re-runs. Cached artifacts allow skipping completed pipeline steps when re-running with the same data and config.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable caching of intermediate pipeline results to |
|
int |
|
Gzip compression level for cached files (0 = no compression, 9 = maximum) |
visualization¶
Plot output settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Save generated plots to the output directory |
|
int |
|
Resolution in dots per inch for raster plot formats |
logging¶
Logging configuration.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Minimum log level: |
|
str |
|
Log message format: |
|
bool |
|
Write log messages to a file in the output directory |
|
str |
|
Name of the log file |
|
bool |
|
Suppress all console output (log file is still written if enabled) |
data_quality¶
Automated data quality assessment run before preprocessing.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable data quality checks |
|
float |
|
Flag columns with more than this fraction of missing values in the report |
|
float |
|
Flag feature pairs with Pearson correlation exceeding this value |
|
float |
|
Flag features with variance below this value as near-constant |
|
bool |
|
Include a data quality summary section in the HTML report |
categorical_flow¶
Categorical variable flow visualisation settings.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Group variables by name prefix for cleaner visualisation |
|
str |
|
Character used to split variable names into prefix groups |
|
dict |
|
Manual variable groupings (e.g., |
|
bool |
|
Show Sankey diagrams (can be cluttered with many variables) |
|
bool |
|
Show proportion heatmap (recommended) |
|
float |
|
Group categories below this proportion as “Other” |
feature_characterization¶
Descriptive feature characterisation settings for the report.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Group features by name prefix |
|
str |
|
Character used to split feature names into prefix groups |
|
dict |
|
Manual feature groupings (e.g., |
|
int |
|
Number of top features to show per group per cluster |
|
int |
|
Number of top features overall (when grouping is disabled) |