7. Evaluation

Ground truth

Two sources, joined on run accession (sample_id):

  1. AMRFinder+ (assets/amrfinder_gt_expansion.tsv) — assembly-derived gene presence calls for the plasmid panel: blaKPC, blaCTX-M, blaNDM, qnrB1, blaOXA-48, aac6-Ib-cr. Also provides a baseline blaSHV presence call.

  2. Kleborate v3 (assets/kpsc_expansion_kleborate_gt_runlevel.tsv) — overrides AMRFinder+ for blaSHV with per-sample BLAST copy counts from the assembly, so the chromosomal-CNV metric is “extra copy ≥ 2” rather than “presence vs absence”.

Both tables are bridged from BioSample-level to run-level identifiers using assets/kpsc_expansion_metadata.tsv (sample_id column may be comma-separated for multi-run BioSamples — we keep the first run).

Metrics

models/evaluation/04_kpsc_evaluation.py computes per-gene:

Metric

Definition

MCC

Matthews correlation coefficient. The primary metric.

FNR

False-negative rate: fraction of true events called as normal. Primary optimisation target — minimising missed AMR calls.

PPV

Precision. Apparent FPs may be real (assembly-derived ground truth is imperfect), so PPV is interpreted with caution.

call_rate

Fraction of GT-callable samples for which CNVRock produced a non-missing call.

n_eval

Number of samples in the evaluation.

For each gene, the eval also reports:

Diagnostic

What it tells you

CRR/PCN by predicted label

Whether the underlying signal (p10/p25/p50/p75/p90 of CRR for chromosomal genes, PCN for plasmid) is correctly stratified by the predicted label. If pred_normal and pred_event overlap, the threshold is wrong.

delta (model-added missingness)

Samples where ground truth is callable but the model failed HMM sanity. High delta = HMM convergence problems.

Stratification by ST

Per-ST MCC, FNR, PPV. Reveals if any clone is systematically mis-called.

Segment-level callability

What % of samples produced ≥ 1 non-CN1 segment. Low callability flags model-collapse modes.

Two-sided evaluation

The eval runs separately for chromosomal and plasmid genes:

  • Chromosomal: uses gene_calls.tsv and chromosomal-flank CRR. Currently scored: blaSHV (extra-copy mode).

  • Plasmid: uses plasmid_gene_calls.tsv and absolute PCN. Currently scored: blaKPC, blaCTX-M, blaNDM, qnrB1, blaOXA-48, aac6-Ib-cr.

Hold-out subset

If cnv_eval_holdout_tsv is set in the config, the evaluation restricts to the listed run accessions. This is how we report out-of-distribution performance for the manuscript (Phase 1 holdout never seen at any tier of the scaling study).

Output

The headline summary evaluation.txt follows a fixed format with three blocks:

OVERALL                  MCC / FNR / PPV / call_rate / n_eval per gene
MISSINGNESS              per-gene gt_miss / pred_miss / delta
CRR/PCN BY LABEL         signal-conditioned-on-predicted-label percentiles

Plus optional per-ST and segment diagnostics sections.

Tip

For the manuscript, the most important rows are the OVERALL MCC values for each gene at each scaling tier (5K → 10K → 20K → 40K). These populate the 8. Scaling study table.