# 9. Methods (parameter choices)

This page documents **non-obvious parameter choices** and **why** — the bits
a reviewer will ask about.

## Hybrid mapping-quality thresholds: chromosome at MQ ≥ 20, plasmid at MQ = 0

After diagnosing why MQ = 40 (Phase 1) made every variant-multi-mapped AMR
gene invisible, we initially ran the entire pipeline at **MQ = 0**. That
recovered all plasmid AMR genes but degraded the **chromosomal blaSHV**
detection (call rate 0.00 at 10K samples — the model produced no calls
because MQ = 0 admits noise from repetitive elements scattered along the
~5.3 Mb chromosome, which competes with the genuine *blaSHV* CNV signal).

The final pipeline runs **two GATK passes per sample** from the same BAM:

| Output | MQ filter | Used by |
|---|---|---|
| `readcounts_subset_mq20/{ACC}.counts.tsv` | **≥ 20** | Chromosome NPY store (`data/inputs/KpSC-expansion-{N}-1000bp-npy/`) |
| `readcounts_subset_mq0/{ACC}.counts.tsv` | **= 0** | Plasmid-family NPY store (`data/inputs/KpSC-expansion-{N}-mq0-plasmid-1000bp-npy/`) |

Rationale per component:

- **Chromosome (NC_016845.1, ~5.3 Mb)** — largely uniquely-mappable. MQ ≥ 20
  is the GATK / DELLY / samtools standard for variant calling, corresponding
  to ≤ 1 % mis-placement probability. Excluding the small fraction of
  repeat/transposon multi-mappers gives the 1D Conv-VAE a cleaner per-bin
  signal at the expense of small "blind spots" around those repeats —
  acceptable because *blaSHV* (the chromosomal CNV target) is not in such
  a region.
- **Plasmid families** — near-identical allele variants (NDM-1 vs NDM-5,
  OXA-48 vs OXA-181, CTX-M-15/14/65/27) sit on different reference contigs.
  Reads from a sample's variant map equally to all family members → MQ = 0
  from BWA. **MQ ≥ 1 would discard them entirely.** Keeping them at MQ = 0
  and summing bin counts across all member-variant CDSs in each family
  (`plasmid_genes_to_npy_kpsc.py --families plasmid_gene_families.tsv`)
  recovers the per-sample per-family signal we want — and matches clinical
  reporting (AMRFinder+, Kleborate report at the family level).

The two MQ passes share the same BWA-MEM BAM; only GATK `CollectReadCounts`
is run twice (~30 s of CPU per sample for the second pass), no extra
downloads or alignments. The `hpc/recount_mq20_from_bams.sh` script handles
the MQ=20 pass from persistent BAMs in `data/raw/bam_subset/`.

### Manuscript wording

> Read depths were quantified twice from each BAM: at MQ ≥ 20 across the
> KpSC chromosome (NC_016845.1) for chromosomal CNV detection, and at
> MQ ≥ 0 across the plasmid contigs to retain reads multi-mapping between
> near-identical AMR allele variants. Per-sample plasmid depths were
> aggregated by gene family before downstream analysis.

## Earlier MQ choice (legacy, MQ ≥ 0 single-pass)

This is the parameter choice that took the longest to get right. The
diagnostic narrative is below; the final settings are:

- GATK `CollectReadCounts --minimum-mapping-quality 0`
- Plasmid PCN evaluated by **gene family** (sum of bin counts across all
  allele-variant CDSs in each family), using
  `assets/plasmid_refs/plasmid_gene_families.tsv`.

### Why MQ = 0

We first ran the pipeline at **MQ = 40** (the Phase 1 value), which produced
**zero plasmid-gene counts** for `blaKPC-2`, `blaCTX-M-15`, `blaTEM-1`,
`aac6-Ib-cr` across all 5,000 samples — while `qnrB1`, `blaOXA-181`,
`blaNDM-5` were fine. We assumed multi-mapping with too-strict MQ and
relaxed to MQ = 20 (the GATK / DELLY / samtools default). It did not help.

Cross-checking the gene coordinates against the curated GenBank annotations
of every contig in `HS11286_extended.fasta` revealed that
`plasmid_gene_coords.tsv` **was correct all along** — the coordinates match
GenBank to within ≤ 3 bp for every gene.

The real cause is **multi-mapping between near-identical allele variants**:

| Family | Member CDSs (in our ref) | nt identity |
|---|---|---|
| `blaNDM` | NDM-1 (MZ606384.2), NDM-5 (CP034201.2) | ~99.8 % (2 aa / 6 nt difference) |
| `blaOXA-48-like` | OXA-48 (JN626286.1), OXA-181 (CP113224.1) | ~98 % (4 aa difference) |
| `blaCTX-M` | CTX-M-15 (MK552109.1), CTX-M-14 (NZ_CP031850.1), CTX-M-65 (CP030319.1), CTX-M-27 (CP138680.1) | ~95–99 % within each subgroup |

A read derived from any one of these variants maps **equally well** to all of
them in the same family. BWA assigns MQ = 0 to these multi-mappers and
places the read at one location at random. At any MQ ≥ 1 they are filtered
out entirely.

**MQ = 0 keeps them.** Because each multi-mapped read still ends up at *one*
location (chosen by BWA), individual variant CDSs receive a fraction of the
true coverage. Summing bin counts across all family members recovers the
full per-sample per-family signal.

### Why this is methodologically sound

1. The multi-mapping is **between annotated allele variants of the same gene
   family** — not between unrelated genomic regions. Counting these reads as
   "the family is present" is biologically correct.
2. **Clinical reporting works at family level.** AMRFinder+ output is
   variant-level but Kleborate-style summary tables report `Bla_Carb_acquired
   = NDM-1` or `NDM-5` interchangeably; treatment recommendations are
   family-level.
3. For chromosomal CNV detection (the `blaSHV` extra-copy story) the KpSC
   chromosome is largely uniquely-mappable. MQ = 0 vs MQ = 20 affects only
   repeat / transposon regions which the VAE absorbs as noise. Single
   pipeline, single count file per sample.
4. The cost of getting MQ = 0 wrong (over-counts from spurious mappings) is
   small: BWA's secondary-alignment quality scores still exclude very poor
   matches, and the 1 kb bin is large enough that occasional spurious
   placements don't dominate the per-sample family count.

### Why gene-family aggregation, not individual genes

Individual blaNDM-1 vs blaNDM-5 detection is fundamentally impossible from
short reads when both reference CDSs are in the alignment target: BWA
*cannot* distinguish them above MQ = 0. AMRFinder+ resolves this differently
— it works on **assemblies**, not raw reads, and exploits *the rest of the
plasmid backbone* as discriminating context. CNVRock works on read depth,
so we accept the limitation and report at family level. The variant call
(if needed) comes from the Kleborate column in the ground-truth join.

## Concurrency cap (10 simultaneous `ascp`)

EBI's Aspera SSH endpoint refuses more than ~10–15 concurrent connections
per source IP. We measured:

| Concurrency | Failure rate |
|---|---|
| 1 | 0 % |
| 10 | 0 % |
| 20 | ≈ 100 % R2 fail |
| 50 | ≈ 100 % fail |

`hpc/aspera_subset_pipeline.sh` is submitted with `--array=…%10` to cap
running tasks at 10. SLURM `MaxArraySize = 5000` is worked around by
chunking the manifest into four ≤ 5000-task arrays with `BATCH_OFFSET`.

## Stratification: species × Bla_Carb × ST cap

See {doc}`04_subset_selection`. ST cap is **150 samples per ST per stratum**
to prevent ST258 / ST11 from dominating the smaller tiers. The full-pool
("80K") tier bypasses the cap.

Sampling within strata is **weighted 1.5× toward carbapenemase carriers**.

## VAE β-warmup

Linear ramp from β = 0 to β = `max_beta` = 1.0 over the first 20 epochs,
then constant. Without β-warmup the KL term dominates early training and
the latent collapses to N(0,1) regardless of input — model produces flat
reconstructions.

## CNV-pattern auxiliary loss

Weight 1.0, warmup 30 epochs. This is a sine-wave-inspired loss that
penalises latent representations failing to track depth gradients across
copy-number boundaries. Empirically required to break the trivial solution
where the VAE encodes only average per-sample depth and reconstructs a flat
profile.

## HMM 6 states with self-transition 0.80

State means initialised at CN ∈ {0, 0.5, 1, 1.5, 2, 3}. Self-transition
0.80 ⇒ expected segment length ≈ 5 kb at 1 kb resolution. Lower
self-transitions over-segment the chromosome; higher values miss small
events (e.g. transposon-bounded gene amplifications).

`hmm_low_cov_threshold = 10` masks bins with < 10 reads — below that the
Gaussian assumption breaks down and the bin is re-imputed from its
neighbours.

## Per-gene PCN thresholds

`pcn_absent_threshold = 0.20`, `pcn_amp_threshold = 1.50` are defaults; the
**per-gene `absent_threshold`** in `plasmid_gene_coords.tsv` overrides
per-gene where the global default would mis-call (e.g. blaCTX-M-15 at 0.50,
blaCTX-M-14 at 1.00 because of higher cross-mapping).

## Chromosomal CRR thresholds

`cnv_crr_amp_threshold = 1.75`, `cnv_crr_gate_threshold = 1.75`. A
gene-region CRR > 1.75 is called as amplified. The gate threshold also
1.75 means a single threshold for both the amp call and the
"signal-strong-enough-to-call" gate. `cnv_min_cn1_proportion = 0.55`
requires the rest of the chromosome to mostly be CN=1 before trusting
the gene-region call (rejects whole-chromosome aneuploidy artefacts).

## Reproducibility seed

The subset selector uses `numpy.random.default_rng(seed=42)`. The same seed
produces identical manifests every time, so the scaling study is fully
deterministic from the same Kleborate ground-truth input.