5. NPY stores
CNVRock’s training pipeline reads from numpy NPY stores, not the raw GATK count TSVs. Each store is a directory with:
data/inputs/KpSC-expansion-{N}-mq20-1000bp-npy/
├── contigs.npy structured array [(chrom, start, end)] per bin
├── counts.npy uint32 (n_samples, n_bins)
└── sample_ids.npy object array of run accessions, matches counts row order
Per scaling tier we materialise two stores:
Store |
Bins |
Source |
|---|---|---|
Chromosome |
5,334 (NC_016845.1 only) |
filter 1 kb bins by contig |
Plasmid family |
7 (one per AMR gene family) |
sum 1 kb bins overlapping all member-variant gene regions in each family |
Why gene families, not individual genes? Nearly-identical allele variants (NDM-1 vs NDM-5, OXA-48 vs OXA-181, CTX-M-15/14/65/27) sit on different reference contigs. A read from a sample carrying any one variant maps equally well to all of them; BWA places it at one position (MQ=0). Aggregating across all family members sums the bin counts back into a single per-family PCN signal — which also happens to match clinical reporting conventions (AMRFinder+, Kleborate report at the family level by default).
The combined aspera_subset_pipeline.sh produces a single count TSV per
sample with all 7,365 bins. Two separate builders then materialise the two
stores from those same files.
Chromosome NPY store
data/setup/readcounts_to_npy_kpsc.py keeps every 1 kb bin in
--keep-contigs. For the chromosome store:
python3 data/setup/readcounts_to_npy_kpsc.py \
--counts-dir data/raw/readcounts_subset_mq20 \
--manifest assets/kpsc_expansion_subset_5k.tsv \
--out-dir data/inputs/KpSC-expansion-5k-mq20-1000bp-npy \
--keep-contigs NC_016845.1
Validates that every sample has the same number of bins; raises if not.
Plasmid-family NPY store
data/setup/plasmid_genes_to_npy_kpsc.py reads plasmid_gene_coords.tsv and
sums the 1 kb bins overlapping each gene region, then (with --families)
aggregates by gene family per plasmid_gene_families.tsv:
Family |
Member variants |
|---|---|
|
blaKPC-2 |
|
blaCTX-M-15, blaCTX-M-14, blaCTX-M-65, blaCTX-M-27 |
|
blaNDM-1, blaNDM-5 |
|
blaOXA-48, blaOXA-181 |
|
blaTEM-1 |
|
qnrB1 |
|
aac6-Ib-cr |
The sum formula:
count[sample, gene] = Σ bin.count
where bin.CHROM == gene.contig
AND bin.END > gene.start
AND bin.START < gene.end
CLI:
python3 data/setup/plasmid_genes_to_npy_kpsc.py \
--counts-dir data/raw/readcounts_subset_mq0 \
--manifest assets/kpsc_expansion_subset_5k.tsv \
--gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \
--families assets/plasmid_refs/plasmid_gene_families.tsv \
--out-dir data/inputs/KpSC-expansion-5k-mq0-plasmid-1000bp-npy
Output is (n_samples, 7) — one column per gene family. Downstream callers
and evaluation code use the family names (blaKPC, blaCTX-M, blaNDM,
blaOXA-48-like, blaTEM, qnrB, aac6-Ib-cr) directly.
Build all four tiers
for N in 5k 10k 20k 40k; do
# Chromosome store
python3 data/setup/readcounts_to_npy_kpsc.py \
--counts-dir data/raw/readcounts_subset_mq0 \
--manifest assets/kpsc_expansion_subset_${N}.tsv \
--out-dir data/inputs/KpSC-expansion-${N}-mq0-1000bp-npy \
--keep-contigs NC_016845.1
# Plasmid-family store
python3 data/setup/plasmid_genes_to_npy_kpsc.py \
--counts-dir data/raw/readcounts_subset_mq0 \
--manifest assets/kpsc_expansion_subset_${N}.tsv \
--gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \
--families assets/plasmid_refs/plasmid_gene_families.tsv \
--out-dir data/inputs/KpSC-expansion-${N}-mq0-plasmid-1000bp-npy
done
Each builder takes 5–15 minutes depending on N and number of workers (the
TSV → numpy conversion is parallelised across --workers threads, default
16).