CNVRock
Pipeline
1. Overview
Problem
Approach
Architecture
Why the scaling study
2. Data acquisition
Why Aspera, not wget
Setup on HPC
Per-task pipeline
SLURM submission
Throughput
Compliance with HPC network policy
3. Reference and intervals
HS11286_extended.fasta
1 kb interval list
Plasmid gene coordinates
Mapping quality threshold
4. Subset selection
Eligibility pipeline
Stratification
Nesting trick
Anchoring to in-flight data
Composition of the 5K seed set
5. NPY stores
Chromosome NPY store
Plasmid-family NPY store
Build all four tiers
6. Training
Entry point
SLURM wrapper
Config schema
Architecture: 1D Conv-VAE
Segmenter: Gaussian HMM
Output artefacts
7. Evaluation
Ground truth
Metrics
Two-sided evaluation
Hold-out subset
Output
Manuscript
8. Scaling study
Tiers
Headline results
Chromosomal blaSHV (extra-copy)
Plasmid genes (presence)
Reading the curve
Early observation (MQ=40 baseline, since superseded)
Reproducibility note
9. Methods (parameter choices)
Hybrid mapping-quality thresholds: chromosome at MQ ≥ 20, plasmid at MQ = 0
Manuscript wording
Earlier MQ choice (legacy, MQ ≥ 0 single-pass)
Why MQ = 0
Why this is methodologically sound
Why gene-family aggregation, not individual genes
Concurrency cap (10 simultaneous
ascp
)
Stratification: species × Bla_Carb × ST cap
VAE β-warmup
CNV-pattern auxiliary loss
HMM 6 states with self-transition 0.80
Per-gene PCN thresholds
Chromosomal CRR thresholds
Reproducibility seed
10. Reproducibility
Software
HPC environment
End-to-end recipe
Commit hashes
Random seeds
Storage
CNVRock
Index
Index