2. Data acquisition
KpSC short-read sequencing data lives on ENA, addressed by run accession
(ERR…, DRR…, SRR…). For ~78,000 samples this is multiple TB of FASTQ,
which precludes brittle HTTPS streaming.
Why Aspera, not wget
We tried both:
Mode |
Single-task |
10 concurrent |
50 concurrent |
|---|---|---|---|
|
✅ works |
❌ R2 fails — instant rejection |
❌ all fail |
|
✅ works |
✅ 10/10 succeed in ~46s |
❌ ~50% fail |
EBI imposes an undocumented per-source-IP concurrent connection limit on HTTPS that drops R2 (the second connection per task) almost instantly. Aspera’s FASP protocol multiplexes over a single SSH session and is the EBI-recommended bulk-download protocol, so we use it throughout.
For details, see EBI’s Bio Image Archive download
guide, which is where
the public-domain ed25519 SSH key used by era-fasp@fasp.sra.ebi.ac.uk is
distributed.
Setup on HPC
# In the cnvrock conda env:
conda install -y -c bioconda aspera-cli
# The bioconda package ships ascli (Ruby wrapper) but not ascp itself.
# Install ascp via:
ascli config ascp install
# Drops the binary at ~/.aspera/sdk/ascp
# Save the EBI public key (one-time):
curl -sL https://www.ebi.ac.uk/bioimage-archive/help-download/ \
| sed -n '/BEGIN OPENSSH PRIVATE KEY/,/END OPENSSH PRIVATE KEY/p' \
> ~/.aspera/sdk/ebi_aspera_key.openssh
chmod 600 ~/.aspera/sdk/ebi_aspera_key.openssh
Per-task pipeline
hpc/aspera_subset_pipeline.sh runs per SLURM array task:
Read one row of the subset manifest TSV (cols:
accession,layout,r1_url,r2_url)ascpdownload R1 (and R2 if PAIRED) todata/raw/fastq_subset/BWA-MEM paired-end align to
HS11286_extended.fasta, thensamtools sortandsamtools index. Output todata/raw/bam_subset/.GATK CollectReadCounts at 1 kb resolution across chrom + plasmid contigs, with
--minimum-mapping-quality 0. Output count TSV todata/raw/readcounts_subset_mq0/{ACC}.counts.tsv.
Important
MQ=0 keeps multi-mapped reads — which is essential for AMR detection because nearly-identical allele variants (NDM-1 vs NDM-5, OXA-48 vs OXA-181, CTX-M-15/14/65/27) live on different reference contigs and reads share homology across them. See 9. Methods (parameter choices) for the diagnosis.
FASTQs and BAMs are kept on disk so we can re-extract counts at different MQ thresholds without re-downloading.
SLURM submission
Concurrency is capped at 10 per array to respect Aspera’s per-IP limit:
# Submit the 20K manifest as four chunks of ~5,000 array tasks each
# (SLURM MaxArraySize=5000 on LSHTM HPC).
for chunk in 1 2 3 4; do
case $chunk in
1) OFFSET=0; ARR=1-4999 ;;
2) OFFSET=4999; ARR=1-4999 ;;
3) OFFSET=9998; ARR=1-4999 ;;
4) OFFSET=14997; ARR=1-5000 ;;
esac
sbatch --parsable \
--export=ALL,MANIFEST=$REPO/assets/kpsc_expansion_subset_20k.tsv,BATCH_OFFSET=$OFFSET \
--array=$ARR%10 hpc/aspera_subset_pipeline.sh
done
Throughput
Stage |
Wall time per sample (avg) |
|---|---|
|
30–60 s |
BWA mem align (4 CPUs, ~5 M reads, 7.2 Mb ref) |
2–3 min |
|
~30 s |
GATK CollectReadCounts |
~30 s |
Total per task |
~3–5 min median |
At concurrency 10, the pipeline sustains ~150–250 samples/hour. The 5K tier took ~30 hours; the 10K and 20K tiers add ~25 and ~50 hours respectively.
Compliance with HPC network policy
A single-IP HTTPS surge from the HPC earlier caused a brief outage. We coordinated with HPC support to limit concurrent connections; Aspera was explicitly endorsed as a low-risk alternative. The CONCUR=10 setting is well within EBI’s published guidance.