# 10. Reproducibility This page is the **one-shot recipe** to regenerate the manuscript from a fresh clone. ## Software - Python 3.11 - PyTorch 2.5.1 (CUDA 12.1) - pandas, numpy, scikit-learn, hmmlearn, tqdm, PyYAML - BWA 0.7.18, samtools 1.20, GATK 4.6.0.0 (HPC modules) - IBM Aspera SDK 4.4.7+ (`ascp`) `environment.yml` at the repo root pins exact versions. ## HPC environment LSHTM compute cluster (Slurm), partition `normal` for CPU tasks and the GPU partition for training. Conda environment name: `cnvrock`. ```bash # One-time setup (HPC): conda env create -f environment.yml conda activate cnvrock conda install -y -c bioconda aspera-cli ascli config ascp install # installs ~/.aspera/sdk/ascp # EBI Aspera SSH key (one-time): curl -sL https://www.ebi.ac.uk/bioimage-archive/help-download/ \ | sed -n '/BEGIN OPENSSH PRIVATE KEY/,/END OPENSSH PRIVATE KEY/p' \ > ~/.aspera/sdk/ebi_aspera_key.openssh chmod 600 ~/.aspera/sdk/ebi_aspera_key.openssh ``` ## End-to-end recipe ```bash # ── 1. Subset manifests ─────────────────────────────────────────────────────── python3 data/setup/select_expansion_subset.py # Writes assets/kpsc_expansion_subset_{5k,10k,20k,40k,80k}.tsv # ── 2. Reference + 1 kb intervals ──────────────────────────────────────────── gatk CreateSequenceDictionary -R assets/HS11286_extended.fasta gatk PreprocessIntervals \ --reference assets/HS11286_extended.fasta \ --bin-length 1000 \ --padding 0 \ --interval-merging-rule OVERLAPPING_ONLY \ --output assets/HS11286_extended_1kb.interval_list # ── 3. Ground truth files (BioSample → run accession bridge) ───────────────── python3 - <<'EOF' import pandas as pd kleb = pd.read_csv('assets/kpsc_expansion_kleborate_gt.tsv', sep='\t', dtype=str, low_memory=False) meta = pd.read_csv('assets/kpsc_expansion_metadata.tsv', sep='\t', dtype=str) meta['run_acc'] = meta['sample_id'].str.split(',').str[0] bridged = kleb.merge(meta[['sample_accession', 'run_acc']], left_on='strain', right_on='sample_accession', how='inner') bridged = bridged.drop(columns=['strain', 'sample_accession']) \ .rename(columns={'run_acc': 'sample_id'}) bridged.to_csv('assets/kpsc_expansion_kleborate_gt_runlevel.tsv', sep='\t', index=False) run_meta = meta.copy() run_meta['sample_id'] = run_meta['sample_id'].str.split(',') run_meta = run_meta.explode('sample_id').reset_index(drop=True) run_meta = run_meta.rename(columns={'species': 'Species'}) run_meta = run_meta.merge( kleb[['strain', 'ST']].rename(columns={'strain': 'sample_accession'}), on='sample_accession', how='left' ) run_meta[['sample_id', 'Species', 'ST', 'sample_accession']].to_csv( 'assets/kpsc_expansion_metadata_runlevel.tsv', sep='\t', index=False ) EOF # ── 4. Download + align + count (per tier; idempotent) ─────────────────────── # Each tier reuses count files from the previous tier — only run the largest. for chunk in 1 2 3 4; do case $chunk in 1) OFFSET=0; ARR=1-4999 ;; 2) OFFSET=4999; ARR=1-4999 ;; 3) OFFSET=9998; ARR=1-4999 ;; 4) OFFSET=14997; ARR=1-5000 ;; esac sbatch --parsable \ --export=ALL,MANIFEST=$PWD/assets/kpsc_expansion_subset_20k.tsv,BATCH_OFFSET=$OFFSET \ --array=$ARR%10 hpc/aspera_subset_pipeline.sh done # For 40k: scale to 8 chunks of ≤ 5000 each. # ── 5. NPY stores (per tier) ───────────────────────────────────────────────── for N in 5k 10k 20k 40k; do python3 data/setup/readcounts_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq0 \ --manifest assets/kpsc_expansion_subset_${N}.tsv \ --out-dir data/inputs/KpSC-expansion-${N}-mq0-1000bp-npy \ --keep-contigs NC_016845.1 python3 data/setup/plasmid_genes_to_npy_kpsc.py \ --counts-dir data/raw/readcounts_subset_mq0 \ --manifest assets/kpsc_expansion_subset_${N}.tsv \ --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \ --families assets/plasmid_refs/plasmid_gene_families.tsv \ --out-dir data/inputs/KpSC-expansion-${N}-mq0-plasmid-1000bp-npy done # ── 6. Train + evaluate ────────────────────────────────────────────────────── sbatch hpc/train_gpu.sh experiments/32/config.yaml # 5K sbatch hpc/train_gpu.sh experiments/33/config.yaml # 10K sbatch hpc/train_gpu.sh experiments/34/config.yaml # 20K sbatch hpc/train_gpu.sh experiments/36/config.yaml # 40K ``` ## Commit hashes Each scaling-study experiment's `evaluation.txt` is committed at a known hash. To exactly reproduce a result, check out the repo at that hash and run the recipe. | Experiment | Final commit | Date | |---|---|---| | exp 32 (5K, MQ=20) | (pending) | (pending) | | exp 33 (10K, MQ=20) | (pending) | (pending) | | exp 34 (20K, MQ=20) | (pending) | (pending) | | exp 36 (40K, MQ=20) | (pending) | (pending) | ## Random seeds - Subset selection: `numpy.random.default_rng(42)` → identical manifests every time. - VAE training: PyTorch default (non-deterministic on GPU). Run-to-run variability is small; for the manuscript we report a single run per tier and discuss variability in the supplementary. ## Storage Per 20K samples: | Asset | Size (≈) | |---|---| | FASTQs (paired, gzipped) | 6–10 TB | | BAMs (sorted + indexed) | 18–24 TB | | Count TSVs (7,365 bins) | 1.5 GB | | NPY stores (chrom + plasmid) | 1.5 GB | | Trained checkpoints + outputs | 200 MB / exp | FASTQs and BAMs are kept on the HPC scratch until the 40K experiment is complete; then archived or deleted depending on storage pressure. Count files and NPY stores are kept indefinitely.