3. Reference and intervals

HS11286_extended.fasta

The base reference is the HS11286 assembly of K. pneumoniae, augmented with 11 plasmid contigs covering the Phase D AMR panel:

Contig

Length

Genes of interest

NC_016845.1

5,333,942 bp

KpSC chromosome (carries blaSHV)

NC_016846.1

111,195 bp

blaKPC-2

MK552109.1

180,286 bp

blaCTX-M-15

MZ606384.2

140,133 bp

blaNDM-1

NZ_CP031850.1

239,020 bp

blaCTX-M-14

NZ_KX236178.1

121,348 bp

blaTEM-1

NC_016980.1

267,242 bp

qnrB1

CP006662.2

85,161 bp

aac(6')-Ib-cr

JN626286.1

61,881 bp

blaOXA-48

CP034201.2

372,826 bp

blaNDM-5

CP113224.1

53,691 bp

blaOXA-181

CP030319.1

169,824 bp

blaCTX-M-65

CP138680.1

221,802 bp

blaCTX-M-27

Total: ~7.2 Mb (chromosome + 11 plasmid backbones).

1 kb interval list

CNVRock trains on per-bin read counts. The interval list defines those bins; this used to be a bug source — the original Phase 2 pipeline pointed at kpsc-whole-chrom.interval_list which declared the chromosome as a single interval. GATK CollectReadCounts does not subdivide intervals, so it produced one-bin-per-sample count files unusable for CNVRock.

Fixed by regenerating with explicit 1 kb binning:

gatk PreprocessIntervals \
    --reference HS11286_extended.fasta \
    --bin-length 1000 \
    --padding 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    --output HS11286_extended_1kb.interval_list

Output: 7,365 bins total = 5,334 on NC_016845.1 + 2,031 across the 12 other contigs.

Plasmid gene coordinates

assets/plasmid_refs/plasmid_gene_coords.tsv defines the 12 plasmid gene regions used for gene-level PCN aggregation:

gene            contig         start    end     absent_threshold
blaKPC-2        NC_016846.1    20557    21435   0.20
blaCTX-M-15     MK552109.1     119392   120264  0.50
blaNDM-1        MZ606384.2     90937    91746   0.20
blaCTX-M-14     NZ_CP031850.1  113509   114381  1.00
blaTEM-1        NZ_KX236178.1  66790    67650   0.20
qnrB1           NC_016980.1    18421    19100   0.20
aac6-Ib-cr      CP006662.2     38208    38689   0.10
blaOXA-48       JN626286.1     5445     6242    0.20
blaNDM-5        CP034201.2     69698    70263   0.20
blaOXA-181      CP113224.1     28840    29637   0.20
blaCTX-M-65     CP030319.1     6227     7049    0.20
blaCTX-M-27     CP138680.1     183201   184385  0.20

The absent_threshold is the per-gene PCN cutoff below which the plasmid is considered absent (handles low-mappability/duplicate-sequence cases).

Mapping quality threshold

GATK’s --minimum-mapping-quality 20 keeps reads where BWA estimates ≤1% mis-placement probability. See 9. Methods (parameter choices) for the rationale (the standard GATK HaplotypeCaller / DELLY default), and why we lowered from MQ=40 (the Phase 1 value) — the extended-reference plasmids share sequence with each other, so MQ=40 was discarding the multi-mapping reads we need for plasmid gene-level CNV calls.