CNVRock
AMR gene copy-number variation in Klebsiella pneumoniae species complex (KpSC) using variational autoencoders.
Antimicrobial resistance (AMR) in KpSC is driven not just by gene presence but by gene copy number — amplification of chromosomal AMR loci and variable plasmid copy number (PCN) both have direct clinical consequences. CNVRock combines a 1D convolutional VAE, a Gaussian HMM segmenter, and gene-level CNV callers to recover both chromosomal and plasmid CN events from short-read sequencing.
This site documents the end-to-end pipeline used in the manuscript: sample selection, data acquisition, reference preparation, model training, and the scaling study (5K → 10K → 20K → 40K).
Pipeline
Manuscript
- 8. Scaling study
- 9. Methods (parameter choices)
- Hybrid mapping-quality thresholds: chromosome at MQ ≥ 20, plasmid at MQ = 0
- Earlier MQ choice (legacy, MQ ≥ 0 single-pass)
- Concurrency cap (10 simultaneous
ascp) - Stratification: species × Bla_Carb × ST cap
- VAE β-warmup
- CNV-pattern auxiliary loss
- HMM 6 states with self-transition 0.80
- Per-gene PCN thresholds
- Chromosomal CRR thresholds
- Reproducibility seed
- 10. Reproducibility
Quick links
Repository: https://github.com/lcerdeira/CNVRock
Interactive demo: https://cnvrock.streamlit.app/ — sample-viewer (per-sample copy-number profile + latents), Monitor, and latent-coverage pages, on bundled 200-sample subsamples of all three organisms (K. pneumoniae 10 K tier, A. baumannii, C. auris). No install required. Each dataset surfaces showcase isolates with the strongest signals (e.g. C. auris ERG11 15× amplification, chr5 aneuploidy; K. pneumoniae blaSHV 85× tandem; A. baumannii blaOXA-23 80×). The bundles embed the read-count inputs store so the copy-number viewer runs on the cloud; rebuild them with
diagnostics/build_demo_bundle.py --store … --showcase ….Manuscript reference cohort: Phase 2 expansion (~78,000 KpSC samples with FASTQ URLs on ENA)
Compute: LSHTM HPC, GPU partition (A40), Aspera SDK for high-throughput EBI downloads
At a glance
Stage |
Tool / Asset |
|---|---|
Stratified sample selection |
|
FASTQ download |
IBM Aspera ( |
Alignment |
BWA-MEM 0.7.18, paired-end |
Read counts |
GATK |
Reference |
|
Training |
1D Conv-VAE + Gaussian HMM, |
Plasmid CNV calls |
|
Chromosomal CNV calls |
|
Evaluation |
AMRFinder+ and Kleborate ground truth |
Note
This documentation is built automatically by Read the Docs on every push to
main. The scaling-study tables and figures populate as the experiments
complete (see 8. Scaling study).