# 5. NPY stores CNVRock's training pipeline reads from **numpy NPY stores**, not the raw GATK count TSVs. Each store is a directory with: ``` data/inputs/KpSC-expansion-{N}-mq20-1000bp-npy/ ├── contigs.npy structured array [(chrom, start, end)] per bin ├── counts.npy uint32 (n_samples, n_bins) └── sample_ids.npy object array of run accessions, matches counts row order ``` Per scaling tier we materialise **two stores**: | Store | Bins | Source | |---|---|---| | Chromosome | 5,334 (NC_016845.1 only) | filter 1 kb bins by contig | | Plasmid family | 7 (one per AMR gene family) | sum 1 kb bins overlapping all member-variant gene regions in each family | **Why gene families, not individual genes?** Nearly-identical allele variants (NDM-1 vs NDM-5, OXA-48 vs OXA-181, CTX-M-15/14/65/27) sit on different reference contigs. A read from a sample carrying any one variant maps equally well to all of them; BWA places it at one position (MQ=0). Aggregating across all family members sums the bin counts back into a single per-family PCN signal — which also happens to match clinical reporting conventions (AMRFinder+, Kleborate report at the family level by default). The combined `aspera_subset_pipeline.sh` produces a single count TSV per sample with all 7,365 bins. Two separate builders then materialise the two stores from those same files. ## Chromosome NPY store `data/setup/readcounts_to_npy_kpsc.py` keeps every 1 kb bin in `--keep-contigs`. For the chromosome store: ```bash python3 data/setup/readcounts_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq20 \ --manifest assets/kpsc_expansion_subset_5k.tsv \ --out-dir data/inputs/KpSC-expansion-5k-mq20-1000bp-npy \ --keep-contigs NC_016845.1 ``` Validates that every sample has the same number of bins; raises if not. ## Plasmid-family NPY store `data/setup/plasmid_genes_to_npy_kpsc.py` reads `plasmid_gene_coords.tsv` and **sums** the 1 kb bins overlapping each gene region, then (with `--families`) aggregates by gene family per `plasmid_gene_families.tsv`: | Family | Member variants | |---|---| | `blaKPC` | blaKPC-2 | | `blaCTX-M` | blaCTX-M-15, blaCTX-M-14, blaCTX-M-65, blaCTX-M-27 | | `blaNDM` | blaNDM-1, blaNDM-5 | | `blaOXA-48-like` | blaOXA-48, blaOXA-181 | | `blaTEM` | blaTEM-1 | | `qnrB` | qnrB1 | | `aac6-Ib-cr` | aac6-Ib-cr | The sum formula: ```python count[sample, gene] = Σ bin.count where bin.CHROM == gene.contig AND bin.END > gene.start AND bin.START < gene.end ``` CLI: ```bash python3 data/setup/plasmid_genes_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq0 \ --manifest assets/kpsc_expansion_subset_5k.tsv \ --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \ --families assets/plasmid_refs/plasmid_gene_families.tsv \ --out-dir data/inputs/KpSC-expansion-5k-mq0-plasmid-1000bp-npy ``` Output is `(n_samples, 7)` — one column per gene family. Downstream callers and evaluation code use the family names (`blaKPC`, `blaCTX-M`, `blaNDM`, `blaOXA-48-like`, `blaTEM`, `qnrB`, `aac6-Ib-cr`) directly. ## Build all four tiers ```bash for N in 5k 10k 20k 40k; do # Chromosome store python3 data/setup/readcounts_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq0 \ --manifest assets/kpsc_expansion_subset_${N}.tsv \ --out-dir data/inputs/KpSC-expansion-${N}-mq0-1000bp-npy \ --keep-contigs NC_016845.1 # Plasmid-family store python3 data/setup/plasmid_genes_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq0 \ --manifest assets/kpsc_expansion_subset_${N}.tsv \ --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \ --families assets/plasmid_refs/plasmid_gene_families.tsv \ --out-dir data/inputs/KpSC-expansion-${N}-mq0-plasmid-1000bp-npy done ``` Each builder takes **5–15 minutes** depending on N and number of workers (the TSV → numpy conversion is parallelised across `--workers` threads, default 16).