7. Evaluation

Ground truth

Two sources, joined on run accession (sample_id):

AMRFinder+ (assets/amrfinder_gt_expansion.tsv) — assembly-derived gene presence calls for the plasmid panel: blaKPC, blaCTX-M, blaNDM, qnrB1, blaOXA-48, aac6-Ib-cr. Also provides a baseline blaSHV presence call.
Kleborate v3 (assets/kpsc_expansion_kleborate_gt_runlevel.tsv) — overrides AMRFinder+ for blaSHV with per-sample BLAST copy counts from the assembly, so the chromosomal-CNV metric is “extra copy ≥ 2” rather than “presence vs absence”.

Both tables are bridged from BioSample-level to run-level identifiers using assets/kpsc_expansion_metadata.tsv (sample_id column may be comma-separated for multi-run BioSamples — we keep the first run).

Metrics

models/evaluation/04_kpsc_evaluation.py computes per-gene:

Metric	Definition
MCC	Matthews correlation coefficient. The primary metric.
FNR	False-negative rate: fraction of true events called as normal. Primary optimisation target — minimising missed AMR calls.
PPV	Precision. Apparent FPs may be real (assembly-derived ground truth is imperfect), so PPV is interpreted with caution.
call_rate	Fraction of GT-callable samples for which CNVRock produced a non-missing call.
n_eval	Number of samples in the evaluation.

For each gene, the eval also reports:

Diagnostic	What it tells you
`CRR/PCN by predicted label`	Whether the underlying signal (`p10/p25/p50/p75/p90` of CRR for chromosomal genes, PCN for plasmid) is correctly stratified by the predicted label. If `pred_normal` and `pred_event` overlap, the threshold is wrong.
`delta` (model-added missingness)	Samples where ground truth is callable but the model failed HMM sanity. High delta = HMM convergence problems.
Stratification by ST	Per-ST MCC, FNR, PPV. Reveals if any clone is systematically mis-called.
Segment-level callability	What % of samples produced ≥ 1 non-CN1 segment. Low callability flags model-collapse modes.

Two-sided evaluation

The eval runs separately for chromosomal and plasmid genes:

Chromosomal: uses gene_calls.tsv and chromosomal-flank CRR. Currently scored: blaSHV (extra-copy mode).
Plasmid: uses plasmid_gene_calls.tsv and absolute PCN. Currently scored: blaKPC, blaCTX-M, blaNDM, qnrB1, blaOXA-48, aac6-Ib-cr.

Hold-out subset

If cnv_eval_holdout_tsv is set in the config, the evaluation restricts to the listed run accessions. This is how we report out-of-distribution performance for the manuscript (Phase 1 holdout never seen at any tier of the scaling study).

Output

The headline summary evaluation.txt follows a fixed format with three blocks:

OVERALL                  MCC / FNR / PPV / call_rate / n_eval per gene
MISSINGNESS              per-gene gt_miss / pred_miss / delta
CRR/PCN BY LABEL         signal-conditioned-on-predicted-label percentiles

Plus optional per-ST and segment diagnostics sections.

Tip

For the manuscript, the most important rows are the OVERALL MCC values for each gene at each scaling tier (5K → 10K → 20K → 40K). These populate the 8. Scaling study table.