3. Reference and intervals
HS11286_extended.fasta
The base reference is the HS11286 assembly of K. pneumoniae, augmented with 11 plasmid contigs covering the Phase D AMR panel:
Contig |
Length |
Genes of interest |
|---|---|---|
NC_016845.1 |
5,333,942 bp |
KpSC chromosome (carries |
NC_016846.1 |
111,195 bp |
|
MK552109.1 |
180,286 bp |
|
MZ606384.2 |
140,133 bp |
|
NZ_CP031850.1 |
239,020 bp |
|
NZ_KX236178.1 |
121,348 bp |
|
NC_016980.1 |
267,242 bp |
|
CP006662.2 |
85,161 bp |
|
JN626286.1 |
61,881 bp |
|
CP034201.2 |
372,826 bp |
|
CP113224.1 |
53,691 bp |
|
CP030319.1 |
169,824 bp |
|
CP138680.1 |
221,802 bp |
|
Total: ~7.2 Mb (chromosome + 11 plasmid backbones).
1 kb interval list
CNVRock trains on per-bin read counts. The interval list defines those bins;
this used to be a bug source — the original Phase 2 pipeline pointed at
kpsc-whole-chrom.interval_list which declared the chromosome as a single
interval. GATK CollectReadCounts does not subdivide intervals, so it
produced one-bin-per-sample count files unusable for CNVRock.
Fixed by regenerating with explicit 1 kb binning:
gatk PreprocessIntervals \
--reference HS11286_extended.fasta \
--bin-length 1000 \
--padding 0 \
--interval-merging-rule OVERLAPPING_ONLY \
--output HS11286_extended_1kb.interval_list
Output: 7,365 bins total = 5,334 on NC_016845.1 + 2,031 across the 12 other contigs.
Plasmid gene coordinates
assets/plasmid_refs/plasmid_gene_coords.tsv defines the 12 plasmid gene
regions used for gene-level PCN aggregation:
gene contig start end absent_threshold
blaKPC-2 NC_016846.1 20557 21435 0.20
blaCTX-M-15 MK552109.1 119392 120264 0.50
blaNDM-1 MZ606384.2 90937 91746 0.20
blaCTX-M-14 NZ_CP031850.1 113509 114381 1.00
blaTEM-1 NZ_KX236178.1 66790 67650 0.20
qnrB1 NC_016980.1 18421 19100 0.20
aac6-Ib-cr CP006662.2 38208 38689 0.10
blaOXA-48 JN626286.1 5445 6242 0.20
blaNDM-5 CP034201.2 69698 70263 0.20
blaOXA-181 CP113224.1 28840 29637 0.20
blaCTX-M-65 CP030319.1 6227 7049 0.20
blaCTX-M-27 CP138680.1 183201 184385 0.20
The absent_threshold is the per-gene PCN cutoff below which the plasmid
is considered absent (handles low-mappability/duplicate-sequence cases).
Mapping quality threshold
GATK’s --minimum-mapping-quality 20 keeps reads where BWA estimates ≤1%
mis-placement probability. See 9. Methods (parameter choices) for the rationale (the
standard GATK HaplotypeCaller / DELLY default), and why we lowered from
MQ=40 (the Phase 1 value) — the extended-reference plasmids share sequence
with each other, so MQ=40 was discarding the multi-mapping reads we need for
plasmid gene-level CNV calls.