# 2. Data acquisition

KpSC short-read sequencing data lives on ENA, addressed by run accession
(`ERR…`, `DRR…`, `SRR…`). For ~78,000 samples this is **multiple TB of FASTQ**,
which precludes brittle HTTPS streaming.

## Why Aspera, not wget

We tried both:

| Mode | Single-task | 10 concurrent | 50 concurrent |
|---|---|---|---|
| `wget` (HTTPS) | ✅ works | ❌ R2 fails — instant rejection | ❌ all fail |
| `ascp` (Aspera) | ✅ works | ✅ 10/10 succeed in ~46s | ❌ ~50% fail |

EBI imposes an undocumented per-source-IP **concurrent connection limit** on
HTTPS that drops R2 (the second connection per task) almost instantly. Aspera's
FASP protocol multiplexes over a single SSH session and is the
**EBI-recommended bulk-download protocol**, so we use it throughout.

For details, see EBI's [Bio Image Archive download
guide](https://www.ebi.ac.uk/bioimage-archive/help-download/), which is where
the public-domain ed25519 SSH key used by `era-fasp@fasp.sra.ebi.ac.uk` is
distributed.

## Setup on HPC

```bash
# In the cnvrock conda env:
conda install -y -c bioconda aspera-cli

# The bioconda package ships ascli (Ruby wrapper) but not ascp itself.
# Install ascp via:
ascli config ascp install
# Drops the binary at ~/.aspera/sdk/ascp

# Save the EBI public key (one-time):
curl -sL https://www.ebi.ac.uk/bioimage-archive/help-download/ \
    | sed -n '/BEGIN OPENSSH PRIVATE KEY/,/END OPENSSH PRIVATE KEY/p' \
    > ~/.aspera/sdk/ebi_aspera_key.openssh
chmod 600 ~/.aspera/sdk/ebi_aspera_key.openssh
```

## Per-task pipeline

`hpc/aspera_subset_pipeline.sh` runs per SLURM array task:

1. Read one row of the **subset manifest** TSV (cols: `accession`, `layout`,
   `r1_url`, `r2_url`)
2. **`ascp` download** R1 (and R2 if PAIRED) to `data/raw/fastq_subset/`
3. **BWA-MEM** paired-end align to `HS11286_extended.fasta`, then `samtools
   sort` and `samtools index`. Output to `data/raw/bam_subset/`.
4. **GATK CollectReadCounts** at 1 kb resolution across chrom + plasmid
   contigs, with `--minimum-mapping-quality 0`. Output count TSV to
   `data/raw/readcounts_subset_mq0/{ACC}.counts.tsv`.

```{important}
MQ=0 keeps **multi-mapped reads** — which is essential for AMR detection
because nearly-identical allele variants (NDM-1 vs NDM-5, OXA-48 vs OXA-181,
CTX-M-15/14/65/27) live on different reference contigs and reads share
homology across them. See {doc}`09_methods` for the diagnosis.
```

FASTQs and BAMs are **kept on disk** so we can re-extract counts at different
MQ thresholds without re-downloading.

## SLURM submission

Concurrency is capped at **10 per array** to respect Aspera's per-IP limit:

```bash
# Submit the 20K manifest as four chunks of ~5,000 array tasks each
# (SLURM MaxArraySize=5000 on LSHTM HPC).
for chunk in 1 2 3 4; do
    case $chunk in
        1) OFFSET=0;     ARR=1-4999 ;;
        2) OFFSET=4999;  ARR=1-4999 ;;
        3) OFFSET=9998;  ARR=1-4999 ;;
        4) OFFSET=14997; ARR=1-5000 ;;
    esac
    sbatch --parsable \
        --export=ALL,MANIFEST=$REPO/assets/kpsc_expansion_subset_20k.tsv,BATCH_OFFSET=$OFFSET \
        --array=$ARR%10 hpc/aspera_subset_pipeline.sh
done
```

## Throughput

| Stage | Wall time per sample (avg) |
|---|---|
| `ascp` download (R1 + R2, ~300 MB total) | 30–60 s |
| BWA mem align (4 CPUs, ~5 M reads, 7.2 Mb ref) | 2–3 min |
| `samtools sort` + `index` | ~30 s |
| GATK CollectReadCounts | ~30 s |
| **Total per task** | **~3–5 min median** |

At concurrency 10, the pipeline sustains ~150–250 samples/hour. The 5K tier
took ~30 hours; the 10K and 20K tiers add ~25 and ~50 hours respectively.

## Compliance with HPC network policy

A single-IP HTTPS surge from the HPC earlier caused a brief outage. We
coordinated with HPC support to limit concurrent connections; Aspera was
explicitly endorsed as a low-risk alternative. The CONCUR=10 setting is well
within EBI's published guidance.