# 1. Overview ## Problem AMR in *Klebsiella pneumoniae* species complex (KpSC) is driven by two mechanisms that conventional gene-presence callers (AMRFinder+, Kleborate, ResFinder) **don't fully capture**: 1. **Chromosomal amplification.** Genes such as `blaSHV` are encoded on the KpSC chromosome at single copy in the reference genome, but clinical isolates can carry **multiple chromosomal copies** that drive elevated resistance. Presence-callers see "blaSHV present" and stop. 2. **Plasmid copy number (PCN).** Plasmid-borne AMR genes (`blaKPC`, `blaNDM`, `blaCTX-M`, `blaOXA-48` …) show large per-plasmid PCN variation across isolates, again undetectable from presence calls alone. Both effects compound: a strain may carry blaSHV with 3× chromosomal copies *and* blaCTX-M-15 on a 5× plasmid. CNVRock recovers both jointly from a single short-read sequencing experiment. ## Approach CNVRock is a three-stage pipeline: 1. **1D Conv-VAE** learns a latent representation of 1 kb-binned read-depth tensors across the KpSC chromosome (~5,334 bins on NC_016845.1). 2. **Gaussian HMM** segments the per-sample latent reconstructions into discrete copy-number states, producing per-bin CN calls. 3. **Gene-level CNV callers** translate per-bin CN segments into per-gene copy-number values, separately for chromosomal genes (using CRR = gene / flank copy ratio) and plasmid genes (using PCN = gene depth / chromosomal median). ## Architecture ``` ┌─────────────────┐ FASTQ ──BWA──► BAM ──GATK──► 1 kb bin counts ─┐ └─────────────────┘ │ ▼ ┌──────────────────┐ │ NPY store │ │ (n_samples, │ │ n_bins) │ └──────────────────┘ │ ▼ ┌──────────────────┐ │ 1D Conv-VAE │ ── reconstructions │ 10-dim latent │ └──────────────────┘ │ ▼ ┌──────────────────┐ │ Gaussian HMM │ ── per-bin CN segments │ 6 states │ └──────────────────┘ │ ┌─────────────┴─────────────┐ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ │ Chrom CNV caller │ │ Plasmid CNV │ │ CRR thresholding │ │ PCN thresholding │ └──────────────────┘ └──────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────┐ │ evaluation.txt (MCC, FNR, PPV, …) │ └──────────────────────────────────────────┘ ``` ## Why the scaling study Phase 1 trained on ~545 KpSC samples. The expansion cohort (this work) makes ~78,000 KpSC samples with FASTQs available. The manuscript reports a **5-point scaling study** (545 / 5K / 10K / 20K / 40K) showing how chromosomal and plasmid AMR detection MCC, FNR and PPV change with training-set size. The hypothesis is **monotonic improvement with diminishing returns** — finding the saturation point matters both scientifically (what's the cost-effective training-set size?) and operationally (do we keep scaling to all 78K, or stop earlier?).