# 7. Evaluation ## Ground truth Two sources, joined on **run accession** (`sample_id`): 1. **AMRFinder+** (`assets/amrfinder_gt_expansion.tsv`) — assembly-derived gene presence calls for the plasmid panel: `blaKPC`, `blaCTX-M`, `blaNDM`, `qnrB1`, `blaOXA-48`, `aac6-Ib-cr`. Also provides a baseline `blaSHV` presence call. 2. **Kleborate v3** (`assets/kpsc_expansion_kleborate_gt_runlevel.tsv`) — overrides AMRFinder+ for `blaSHV` with **per-sample BLAST copy counts** from the assembly, so the chromosomal-CNV metric is "extra copy ≥ 2" rather than "presence vs absence". Both tables are bridged from BioSample-level to run-level identifiers using `assets/kpsc_expansion_metadata.tsv` (`sample_id` column may be comma-separated for multi-run BioSamples — we keep the first run). ## Metrics `models/evaluation/04_kpsc_evaluation.py` computes per-gene: | Metric | Definition | |---|---| | **MCC** | Matthews correlation coefficient. The primary metric. | | **FNR** | False-negative rate: fraction of true events called as normal. **Primary optimisation target** — minimising missed AMR calls. | | **PPV** | Precision. Apparent FPs may be real (assembly-derived ground truth is imperfect), so PPV is interpreted with caution. | | **call_rate** | Fraction of GT-callable samples for which CNVRock produced a non-missing call. | | **n_eval** | Number of samples in the evaluation. | For each gene, the eval also reports: | Diagnostic | What it tells you | |---|---| | `CRR/PCN by predicted label` | Whether the underlying signal (`p10/p25/p50/p75/p90` of CRR for chromosomal genes, PCN for plasmid) is correctly stratified by the predicted label. If `pred_normal` and `pred_event` overlap, the threshold is wrong. | | `delta` (model-added missingness) | Samples where ground truth is callable but the model failed HMM sanity. High delta = HMM convergence problems. | | Stratification by ST | Per-ST MCC, FNR, PPV. Reveals if any clone is systematically mis-called. | | Segment-level callability | What % of samples produced ≥ 1 non-CN1 segment. Low callability flags model-collapse modes. | ## Two-sided evaluation The eval runs **separately** for chromosomal and plasmid genes: - **Chromosomal:** uses `gene_calls.tsv` and chromosomal-flank CRR. Currently scored: `blaSHV` (extra-copy mode). - **Plasmid:** uses `plasmid_gene_calls.tsv` and absolute PCN. Currently scored: `blaKPC`, `blaCTX-M`, `blaNDM`, `qnrB1`, `blaOXA-48`, `aac6-Ib-cr`. ## Hold-out subset If `cnv_eval_holdout_tsv` is set in the config, the evaluation restricts to the listed run accessions. This is how we report **out-of-distribution performance** for the manuscript (Phase 1 holdout never seen at any tier of the scaling study). ## Output The headline summary `evaluation.txt` follows a fixed format with three blocks: ``` OVERALL MCC / FNR / PPV / call_rate / n_eval per gene MISSINGNESS per-gene gt_miss / pred_miss / delta CRR/PCN BY LABEL signal-conditioned-on-predicted-label percentiles ``` Plus optional **per-ST** and **segment diagnostics** sections. ```{tip} For the manuscript, the most important rows are the OVERALL MCC values for each gene at each scaling tier (5K → 10K → 20K → 40K). These populate the {doc}`08_scaling_study` table. ```