1. Overview

Problem

AMR in Klebsiella pneumoniae species complex (KpSC) is driven by two mechanisms that conventional gene-presence callers (AMRFinder+, Kleborate, ResFinder) don’t fully capture:

Chromosomal amplification. Genes such as blaSHV are encoded on the KpSC chromosome at single copy in the reference genome, but clinical isolates can carry multiple chromosomal copies that drive elevated resistance. Presence-callers see “blaSHV present” and stop.
Plasmid copy number (PCN). Plasmid-borne AMR genes (blaKPC, blaNDM, blaCTX-M, blaOXA-48 …) show large per-plasmid PCN variation across isolates, again undetectable from presence calls alone.

Both effects compound: a strain may carry blaSHV with 3× chromosomal copies and blaCTX-M-15 on a 5× plasmid. CNVRock recovers both jointly from a single short-read sequencing experiment.

Approach

CNVRock is a three-stage pipeline:

1D Conv-VAE learns a latent representation of 1 kb-binned read-depth tensors across the KpSC chromosome (~5,334 bins on NC_016845.1).
Gaussian HMM segments the per-sample latent reconstructions into discrete copy-number states, producing per-bin CN calls.
Gene-level CNV callers translate per-bin CN segments into per-gene copy-number values, separately for chromosomal genes (using CRR = gene / flank copy ratio) and plasmid genes (using PCN = gene depth / chromosomal median).

Architecture

                                    ┌─────────────────┐
   FASTQ ──BWA──► BAM ──GATK──►  1 kb bin counts   ─┐
                                    └─────────────────┘ │
                                                        ▼
                                              ┌──────────────────┐
                                              │  NPY store       │
                                              │  (n_samples,     │
                                              │   n_bins)        │
                                              └──────────────────┘
                                                        │
                                                        ▼
                                              ┌──────────────────┐
                                              │  1D Conv-VAE     │ ── reconstructions
                                              │  10-dim latent   │
                                              └──────────────────┘
                                                        │
                                                        ▼
                                              ┌──────────────────┐
                                              │  Gaussian HMM    │ ── per-bin CN segments
                                              │  6 states        │
                                              └──────────────────┘
                                                        │
                                          ┌─────────────┴─────────────┐
                                          ▼                           ▼
                                ┌──────────────────┐        ┌──────────────────┐
                                │ Chrom CNV caller │        │ Plasmid CNV      │
                                │ CRR thresholding │        │ PCN thresholding │
                                └──────────────────┘        └──────────────────┘
                                          │                           │
                                          ▼                           ▼
                                ┌──────────────────────────────────────────┐
                                │   evaluation.txt (MCC, FNR, PPV, …)      │
                                └──────────────────────────────────────────┘

Why the scaling study

Phase 1 trained on ~545 KpSC samples. The expansion cohort (this work) makes ~78,000 KpSC samples with FASTQs available. The manuscript reports a 5-point scaling study (545 / 5K / 10K / 20K / 40K) showing how chromosomal and plasmid AMR detection MCC, FNR and PPV change with training-set size.

The hypothesis is monotonic improvement with diminishing returns — finding the saturation point matters both scientifically (what’s the cost-effective training-set size?) and operationally (do we keep scaling to all 78K, or stop earlier?).