10. Reproducibility

This page is the one-shot recipe to regenerate the manuscript from a fresh clone.

Software

Python 3.11
PyTorch 2.5.1 (CUDA 12.1)
pandas, numpy, scikit-learn, hmmlearn, tqdm, PyYAML
BWA 0.7.18, samtools 1.20, GATK 4.6.0.0 (HPC modules)
IBM Aspera SDK 4.4.7+ (ascp)

environment.yml at the repo root pins exact versions.

HPC environment

LSHTM compute cluster (Slurm), partition normal for CPU tasks and the GPU partition for training. Conda environment name: cnvrock.

# One-time setup (HPC):
conda env create -f environment.yml
conda activate cnvrock
conda install -y -c bioconda aspera-cli
ascli config ascp install   # installs ~/.aspera/sdk/ascp

# EBI Aspera SSH key (one-time):
curl -sL https://www.ebi.ac.uk/bioimage-archive/help-download/ \
    | sed -n '/BEGIN OPENSSH PRIVATE KEY/,/END OPENSSH PRIVATE KEY/p' \
    > ~/.aspera/sdk/ebi_aspera_key.openssh
chmod 600 ~/.aspera/sdk/ebi_aspera_key.openssh

End-to-end recipe

# ── 1. Subset manifests ───────────────────────────────────────────────────────
python3 data/setup/select_expansion_subset.py
# Writes assets/kpsc_expansion_subset_{5k,10k,20k,40k,80k}.tsv

# ── 2. Reference + 1 kb intervals ────────────────────────────────────────────
gatk CreateSequenceDictionary -R assets/HS11286_extended.fasta
gatk PreprocessIntervals \
    --reference assets/HS11286_extended.fasta \
    --bin-length 1000 \
    --padding 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    --output assets/HS11286_extended_1kb.interval_list

# ── 3. Ground truth files (BioSample → run accession bridge) ─────────────────
python3 - <<'EOF'
import pandas as pd
kleb = pd.read_csv('assets/kpsc_expansion_kleborate_gt.tsv',
                   sep='\t', dtype=str, low_memory=False)
meta = pd.read_csv('assets/kpsc_expansion_metadata.tsv', sep='\t', dtype=str)
meta['run_acc'] = meta['sample_id'].str.split(',').str[0]

bridged = kleb.merge(meta[['sample_accession', 'run_acc']],
                     left_on='strain', right_on='sample_accession',
                     how='inner')
bridged = bridged.drop(columns=['strain', 'sample_accession']) \
                 .rename(columns={'run_acc': 'sample_id'})
bridged.to_csv('assets/kpsc_expansion_kleborate_gt_runlevel.tsv',
               sep='\t', index=False)

run_meta = meta.copy()
run_meta['sample_id'] = run_meta['sample_id'].str.split(',')
run_meta = run_meta.explode('sample_id').reset_index(drop=True)
run_meta = run_meta.rename(columns={'species': 'Species'})
run_meta = run_meta.merge(
    kleb[['strain', 'ST']].rename(columns={'strain': 'sample_accession'}),
    on='sample_accession', how='left'
)
run_meta[['sample_id', 'Species', 'ST', 'sample_accession']].to_csv(
    'assets/kpsc_expansion_metadata_runlevel.tsv', sep='\t', index=False
)
EOF

# ── 4. Download + align + count (per tier; idempotent) ───────────────────────
# Each tier reuses count files from the previous tier — only run the largest.
for chunk in 1 2 3 4; do
    case $chunk in
        1) OFFSET=0;     ARR=1-4999 ;;
        2) OFFSET=4999;  ARR=1-4999 ;;
        3) OFFSET=9998;  ARR=1-4999 ;;
        4) OFFSET=14997; ARR=1-5000 ;;
    esac
    sbatch --parsable \
        --export=ALL,MANIFEST=$PWD/assets/kpsc_expansion_subset_20k.tsv,BATCH_OFFSET=$OFFSET \
        --array=$ARR%10 hpc/aspera_subset_pipeline.sh
done
# For 40k: scale to 8 chunks of ≤ 5000 each.

# ── 5. NPY stores (per tier) ─────────────────────────────────────────────────
for N in 5k 10k 20k 40k; do
    python3 data/setup/readcounts_to_npy_kpsc.py \
        --counts-dir data/raw/readcounts_subset_mq0 \
        --manifest   assets/kpsc_expansion_subset_${N}.tsv \
        --out-dir    data/inputs/KpSC-expansion-${N}-mq0-1000bp-npy \
        --keep-contigs NC_016845.1

    python3 data/setup/plasmid_genes_to_npy_kpsc.py \
        --counts-dir  data/raw/readcounts_subset_mq0 \
        --manifest    assets/kpsc_expansion_subset_${N}.tsv \
        --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \
        --families    assets/plasmid_refs/plasmid_gene_families.tsv \
        --out-dir     data/inputs/KpSC-expansion-${N}-mq0-plasmid-1000bp-npy
done

# ── 6. Train + evaluate ──────────────────────────────────────────────────────
sbatch hpc/train_gpu.sh experiments/32/config.yaml   # 5K
sbatch hpc/train_gpu.sh experiments/33/config.yaml   # 10K
sbatch hpc/train_gpu.sh experiments/34/config.yaml   # 20K
sbatch hpc/train_gpu.sh experiments/36/config.yaml   # 40K

Commit hashes

Each scaling-study experiment’s evaluation.txt is committed at a known hash. To exactly reproduce a result, check out the repo at that hash and run the recipe.

Experiment	Final commit	Date
exp 32 (5K, MQ=20)	(pending)	(pending)
exp 33 (10K, MQ=20)	(pending)	(pending)
exp 34 (20K, MQ=20)	(pending)	(pending)
exp 36 (40K, MQ=20)	(pending)	(pending)

Random seeds

Subset selection: numpy.random.default_rng(42) → identical manifests every time.
VAE training: PyTorch default (non-deterministic on GPU). Run-to-run variability is small; for the manuscript we report a single run per tier and discuss variability in the supplementary.

Storage

Per 20K samples:

Asset	Size (≈)
FASTQs (paired, gzipped)	6–10 TB
BAMs (sorted + indexed)	18–24 TB
Count TSVs (7,365 bins)	1.5 GB
NPY stores (chrom + plasmid)	1.5 GB
Trained checkpoints + outputs	200 MB / exp

FASTQs and BAMs are kept on the HPC scratch until the 40K experiment is complete; then archived or deleted depending on storage pressure. Count files and NPY stores are kept indefinitely.