Quick Start¶
This guide walks through a typical PhenoCluster workflow in four steps.
1. Generate a configuration file¶
phenocluster create-config -p complete -o config.yaml
This creates a fully commented YAML file with sensible defaults from the
complete profile. See Configuration Profiles for all available profiles.
2. Edit the configuration¶
Open config.yaml and fill in your dataset-specific parameters:
global:
project_name: "My Study"
output_dir: "results"
random_state: 42
data:
continuous_columns:
- age
- bmi
- lab_value_1
categorical_columns:
- sex
- smoking_status
- disease_stage
split:
test_size: 0.2
outcome:
enabled: true
outcome_columns:
- mortality_30d
- readmission_30d
survival:
enabled: true
targets:
- name: "overall_survival"
time_column: "time_to_death"
event_column: "death_indicator"
You can validate the config before running:
phenocluster validate-config -c config.yaml -d data.csv
3. Run the pipeline¶
phenocluster run -d data.csv -c config.yaml
Use --force-rerun to ignore cached intermediate results.
4. Inspect results¶
Results are written to the output directory (default results/):
File |
Description |
|---|---|
|
Comprehensive HTML report with all results and visualisations |
|
Phenotype sizes, feature distributions, and classification quality |
|
Odds ratios with confidence intervals and p-values |
|
Kaplan-Meier estimates and Cox PH hazard ratios |
|
Transition hazard ratios, pathways, and state occupation |
|
Information criteria, entropy, and posterior probabilities |
|
Original data augmented with phenotype assignments |
|
Posterior class membership probabilities per patient |
|
Model selection comparison table and best model info |
|
Feature characterisation per phenotype |
|
Internal validation metrics (train/test comparison) |
|
Consensus clustering stability metrics |
|
Train/test split details |
|
External validation results (when enabled) |
|
Pipeline execution log |
|
Cached intermediate results for incremental re-runs |