4. Subset selection

For the scaling study we generate four strictly-nested manifests:

5,000  ⊂  10,000  ⊂  20,000  ⊂  40,000  ⊂  77,906 (full eligible)

A sample in a smaller subset is always also in every larger one. This is critical for the scaling study: any change in metrics is attributable to more data, not to a different sample mix.

Script: data/setup/select_expansion_subset.py (run once locally — outputs all four manifests to assets/).

Eligibility pipeline

Starting from 88,128 KpSC samples in our Kleborate ground truth:

88,128  Kleborate-typed samples
   ↓    KpSC core species filter
85,339  K. pneumoniae / K. quasipneumoniae / K. variicola / K. africana
   ↓    BioSample → run-accession bridge (kpsc_expansion_metadata.tsv)
85,339
   ↓    inner join with ENA URL manifest (must have FASTQs)
77,906  samples eligible for download

Stratification

Inside the eligible pool we stratify by species × Bla_Carb_acquired (carbapenemase carrier yes/no), and within each stratum cap each ST at 150 samples to avoid over-weighting common clones (ST258, ST11, …). This caps the pool at 38,192 ST-controlled samples; for the 40K and 80K tiers we draw additional samples without the ST cap to reach the target size.

Inside each stratum samples are weighted 1.5× toward carbapenemase carriers before drawing — they’re the positive class CNVRock cares about, so we want them well-represented in the smaller tiers.

Nesting trick

We do one weighted shuffle of the full eligible pool with seed=42. The manifests are then prefixes of that single ordering:

ordered = rng.choice(eligible, size=len(eligible), replace=False, p=weights)
manifest_5k  = ordered[:5_000]
manifest_10k = ordered[:10_000]
manifest_20k = ordered[:20_000]
manifest_40k = ordered[:40_000]
manifest_80k = ordered          # full pool

By construction, every smaller manifest is a strict subset of every larger one. Verified at write time:

$ wc -l assets/kpsc_expansion_subset_*.tsv
 kpsc_expansion_subset_5k.tsv
 kpsc_expansion_subset_10k.tsv
 kpsc_expansion_subset_20k.tsv
 kpsc_expansion_subset_40k.tsv
 kpsc_expansion_subset_80k.tsv

Anchoring to in-flight data

If assets/kpsc_expansion_subset_5k.tsv already exists when the script is re-run (e.g. because we already started downloading those samples), the script locks in that 5K and only varies the remainder of the ordering. This guards against wasting count files when the script logic is updated mid-run.

Composition of the 5K seed set

1,383 unique sequence types (STs)
2,112 carbapenemase carriers (42.2%)

Klebsiella pneumoniae                            False  2,443
                                                 True   1,903
Klebsiella variicola subsp. variicola            False    231
                                                 True      72
Klebsiella quasipneumoniae subsp. similipneum.   False    135
                                                 True      94
Klebsiella quasipneumoniae subsp. quasipneum.    False     75
                                                 True      43
Klebsiella africana                              False      3
Klebsiella variicola subsp. tropica              False      1

The full per-tier breakdown is logged by the selection script and committed to assets/kpsc_expansion_subset_5k_meta.tsv for the seed set.