Quick Start

This guide walks through a typical PhenoCluster workflow in four steps.

1. Generate a configuration file

phenocluster create-config -p complete -o config.yaml

This creates a fully commented YAML file with sensible defaults from the complete profile. See Configuration Profiles for all available profiles.

2. Edit the configuration

Open config.yaml and fill in your dataset-specific parameters:

global:
  project_name: "My Study"
  output_dir: "results"
  random_state: 42

data:
  continuous_columns:
    - age
    - bmi
    - lab_value_1
  categorical_columns:
    - sex
    - smoking_status
    - disease_stage
  split:
    test_size: 0.2

outcome:
  enabled: true
  outcome_columns:
    - mortality_30d
    - readmission_30d

survival:
  enabled: true
  targets:
    - name: "overall_survival"
      time_column: "time_to_death"
      event_column: "death_indicator"

You can validate the config before running:

phenocluster validate-config -c config.yaml -d data.csv

3. Run the pipeline

phenocluster run -d data.csv -c config.yaml

Use --force-rerun to ignore cached intermediate results.

4. Inspect results

Results are written to the output directory (default results/):

File

Description

analysis_report.html

Comprehensive HTML report with all results and visualisations

cluster_statistics.json

Phenotype sizes, feature distributions, and classification quality

outcome_results.json

Odds ratios with confidence intervals and p-values

survival_results.json

Kaplan-Meier estimates and Cox PH hazard ratios

multistate_results.json

Transition hazard ratios, pathways, and state occupation

data/model_fit_metrics.csv

Information criteria, entropy, and posterior probabilities

data/phenotypes_data.csv

Original data augmented with phenotype assignments

data/posterior_probabilities.csv

Posterior class membership probabilities per patient

results/model_selection_summary.json

Model selection comparison table and best model info

results/feature_importance.json

Feature characterisation per phenotype

results/validation_report.json

Internal validation metrics (train/test comparison)

results/stability_results.json

Consensus clustering stability metrics

results/split_info.json

Train/test split details

results/external_validation_results.json

External validation results (when enabled)

phenocluster.log

Pipeline execution log

artifacts/

Cached intermediate results for incremental re-runs