# 4. Subset selection For the scaling study we generate **four strictly-nested manifests**: ``` 5,000 ⊂ 10,000 ⊂ 20,000 ⊂ 40,000 ⊂ 77,906 (full eligible) ``` A sample in a smaller subset is **always also in every larger one**. This is critical for the scaling study: any change in metrics is attributable to *more data*, not to a different sample mix. Script: `data/setup/select_expansion_subset.py` (run once locally — outputs all four manifests to `assets/`). ## Eligibility pipeline Starting from 88,128 KpSC samples in our Kleborate ground truth: ``` 88,128 Kleborate-typed samples ↓ KpSC core species filter 85,339 K. pneumoniae / K. quasipneumoniae / K. variicola / K. africana ↓ BioSample → run-accession bridge (kpsc_expansion_metadata.tsv) 85,339 ↓ inner join with ENA URL manifest (must have FASTQs) 77,906 samples eligible for download ``` ## Stratification Inside the eligible pool we **stratify by species × Bla_Carb_acquired** (carbapenemase carrier yes/no), and within each stratum **cap each ST at 150 samples** to avoid over-weighting common clones (ST258, ST11, …). This caps the pool at 38,192 ST-controlled samples; for the **40K and 80K tiers** we draw additional samples without the ST cap to reach the target size. Inside each stratum samples are weighted **1.5× toward carbapenemase carriers** before drawing — they're the positive class CNVRock cares about, so we want them well-represented in the smaller tiers. ## Nesting trick We do **one** weighted shuffle of the full eligible pool with `seed=42`. The manifests are then **prefixes** of that single ordering: ```python ordered = rng.choice(eligible, size=len(eligible), replace=False, p=weights) manifest_5k = ordered[:5_000] manifest_10k = ordered[:10_000] manifest_20k = ordered[:20_000] manifest_40k = ordered[:40_000] manifest_80k = ordered # full pool ``` By construction, **every smaller manifest is a strict subset** of every larger one. Verified at write time: ```bash $ wc -l assets/kpsc_expansion_subset_*.tsv 5001 kpsc_expansion_subset_5k.tsv 10001 kpsc_expansion_subset_10k.tsv 20001 kpsc_expansion_subset_20k.tsv 40001 kpsc_expansion_subset_40k.tsv 77907 kpsc_expansion_subset_80k.tsv ``` ## Anchoring to in-flight data If `assets/kpsc_expansion_subset_5k.tsv` **already exists** when the script is re-run (e.g. because we already started downloading those samples), the script **locks in** that 5K and only varies the *remainder* of the ordering. This guards against wasting count files when the script logic is updated mid-run. ## Composition of the 5K seed set ``` 1,383 unique sequence types (STs) 2,112 carbapenemase carriers (42.2%) Klebsiella pneumoniae False 2,443 True 1,903 Klebsiella variicola subsp. variicola False 231 True 72 Klebsiella quasipneumoniae subsp. similipneum. False 135 True 94 Klebsiella quasipneumoniae subsp. quasipneum. False 75 True 43 Klebsiella africana False 3 Klebsiella variicola subsp. tropica False 1 ``` The full per-tier breakdown is logged by the selection script and committed to `assets/kpsc_expansion_subset_5k_meta.tsv` for the seed set.