5. NPY stores

CNVRock’s training pipeline reads from numpy NPY stores, not the raw GATK count TSVs. Each store is a directory with:

data/inputs/KpSC-expansion-{N}-mq20-1000bp-npy/
├── contigs.npy      structured array [(chrom, start, end)] per bin
├── counts.npy       uint32 (n_samples, n_bins)
└── sample_ids.npy   object array of run accessions, matches counts row order

Per scaling tier we materialise two stores:

Store	Bins	Source
Chromosome	5,334 (NC_016845.1 only)	filter 1 kb bins by contig
Plasmid family	7 (one per AMR gene family)	sum 1 kb bins overlapping all member-variant gene regions in each family

Why gene families, not individual genes? Nearly-identical allele variants (NDM-1 vs NDM-5, OXA-48 vs OXA-181, CTX-M-15/14/65/27) sit on different reference contigs. A read from a sample carrying any one variant maps equally well to all of them; BWA places it at one position (MQ=0). Aggregating across all family members sums the bin counts back into a single per-family PCN signal — which also happens to match clinical reporting conventions (AMRFinder+, Kleborate report at the family level by default).

The combined aspera_subset_pipeline.sh produces a single count TSV per sample with all 7,365 bins. Two separate builders then materialise the two stores from those same files.

Chromosome NPY store

data/setup/readcounts_to_npy_kpsc.py keeps every 1 kb bin in --keep-contigs. For the chromosome store:

python3 data/setup/readcounts_to_npy_kpsc.py \
    --counts-dir   data/raw/readcounts_subset_mq20 \
    --manifest     assets/kpsc_expansion_subset_5k.tsv \
    --out-dir      data/inputs/KpSC-expansion-5k-mq20-1000bp-npy \
    --keep-contigs NC_016845.1

Validates that every sample has the same number of bins; raises if not.

Plasmid-family NPY store

data/setup/plasmid_genes_to_npy_kpsc.py reads plasmid_gene_coords.tsv and sums the 1 kb bins overlapping each gene region, then (with --families) aggregates by gene family per plasmid_gene_families.tsv:

Family	Member variants
`blaKPC`	blaKPC-2
`blaCTX-M`	blaCTX-M-15, blaCTX-M-14, blaCTX-M-65, blaCTX-M-27
`blaNDM`	blaNDM-1, blaNDM-5
`blaOXA-48-like`	blaOXA-48, blaOXA-181
`blaTEM`	blaTEM-1
`qnrB`	qnrB1
`aac6-Ib-cr`	aac6-Ib-cr

The sum formula:

count[sample, gene] = Σ bin.count
                      where bin.CHROM == gene.contig
                        AND bin.END    >  gene.start
                        AND bin.START  <  gene.end

CLI:

python3 data/setup/plasmid_genes_to_npy_kpsc.py \
    --counts-dir   data/raw/readcounts_subset_mq0 \
    --manifest     assets/kpsc_expansion_subset_5k.tsv \
    --gene-coords  assets/plasmid_refs/plasmid_gene_coords.tsv \
    --families     assets/plasmid_refs/plasmid_gene_families.tsv \
    --out-dir      data/inputs/KpSC-expansion-5k-mq0-plasmid-1000bp-npy

Output is (n_samples, 7) — one column per gene family. Downstream callers and evaluation code use the family names (blaKPC, blaCTX-M, blaNDM, blaOXA-48-like, blaTEM, qnrB, aac6-Ib-cr) directly.

Build all four tiers

for N in 5k 10k 20k 40k; do
    # Chromosome store
    python3 data/setup/readcounts_to_npy_kpsc.py \
        --counts-dir data/raw/readcounts_subset_mq0 \
        --manifest   assets/kpsc_expansion_subset_${N}.tsv \
        --out-dir    data/inputs/KpSC-expansion-${N}-mq0-1000bp-npy \
        --keep-contigs NC_016845.1

    # Plasmid-family store
    python3 data/setup/plasmid_genes_to_npy_kpsc.py \
        --counts-dir  data/raw/readcounts_subset_mq0 \
        --manifest    assets/kpsc_expansion_subset_${N}.tsv \
        --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \
        --families    assets/plasmid_refs/plasmid_gene_families.tsv \
        --out-dir     data/inputs/KpSC-expansion-${N}-mq0-plasmid-1000bp-npy
done

Each builder takes 5–15 minutes depending on N and number of workers (the TSV → numpy conversion is parallelised across --workers threads, default 16).