7. Evaluation
Ground truth
Two sources, joined on run accession (sample_id):
AMRFinder+ (
assets/amrfinder_gt_expansion.tsv) — assembly-derived gene presence calls for the plasmid panel:blaKPC,blaCTX-M,blaNDM,qnrB1,blaOXA-48,aac6-Ib-cr. Also provides a baselineblaSHVpresence call.Kleborate v3 (
assets/kpsc_expansion_kleborate_gt_runlevel.tsv) — overrides AMRFinder+ forblaSHVwith per-sample BLAST copy counts from the assembly, so the chromosomal-CNV metric is “extra copy ≥ 2” rather than “presence vs absence”.
Both tables are bridged from BioSample-level to run-level identifiers using
assets/kpsc_expansion_metadata.tsv (sample_id column may be
comma-separated for multi-run BioSamples — we keep the first run).
Metrics
models/evaluation/04_kpsc_evaluation.py computes per-gene:
Metric |
Definition |
|---|---|
MCC |
Matthews correlation coefficient. The primary metric. |
FNR |
False-negative rate: fraction of true events called as normal. Primary optimisation target — minimising missed AMR calls. |
PPV |
Precision. Apparent FPs may be real (assembly-derived ground truth is imperfect), so PPV is interpreted with caution. |
call_rate |
Fraction of GT-callable samples for which CNVRock produced a non-missing call. |
n_eval |
Number of samples in the evaluation. |
For each gene, the eval also reports:
Diagnostic |
What it tells you |
|---|---|
|
Whether the underlying signal ( |
|
Samples where ground truth is callable but the model failed HMM sanity. High delta = HMM convergence problems. |
Stratification by ST |
Per-ST MCC, FNR, PPV. Reveals if any clone is systematically mis-called. |
Segment-level callability |
What % of samples produced ≥ 1 non-CN1 segment. Low callability flags model-collapse modes. |
Two-sided evaluation
The eval runs separately for chromosomal and plasmid genes:
Chromosomal: uses
gene_calls.tsvand chromosomal-flank CRR. Currently scored:blaSHV(extra-copy mode).Plasmid: uses
plasmid_gene_calls.tsvand absolute PCN. Currently scored:blaKPC,blaCTX-M,blaNDM,qnrB1,blaOXA-48,aac6-Ib-cr.
Hold-out subset
If cnv_eval_holdout_tsv is set in the config, the evaluation restricts to
the listed run accessions. This is how we report out-of-distribution
performance for the manuscript (Phase 1 holdout never seen at any tier of
the scaling study).
Output
The headline summary evaluation.txt follows a fixed format with three blocks:
OVERALL MCC / FNR / PPV / call_rate / n_eval per gene
MISSINGNESS per-gene gt_miss / pred_miss / delta
CRR/PCN BY LABEL signal-conditioned-on-predicted-label percentiles
Plus optional per-ST and segment diagnostics sections.
Tip
For the manuscript, the most important rows are the OVERALL MCC values for each gene at each scaling tier (5K → 10K → 20K → 40K). These populate the 8. Scaling study table.