# 6. Training

CNVRock training is **one entry point**: `models/train.py`. It loads a YAML
config, builds the dataset, trains the VAE, runs inference + HMM segmentation,
calls per-gene CNVs (chromosomal + plasmid), and writes evaluation outputs.

## Entry point

```bash
python models/train.py models/experiments/32/config.yaml
```

## SLURM wrapper

`hpc/train_gpu.sh` requests a GPU node, activates the conda env, and `cd`s
into `models/` before invoking `train.py`. Submit with:

```bash
sbatch hpc/train_gpu.sh experiments/32/config.yaml
```

Run-time on an A40: **~4 min for 5K samples, ~6 min for 10K samples**
(150 epochs × ~40 batches × 128 samples/batch).

## Config schema

Every experiment lives at `models/experiments/{N}/config.yaml`. The configs
for exp 32–36 share **identical architecture, HMM, CNV-caller and threshold
parameters** — only `store_path`, `plasmid_store_path`, and `out_dir` vary
across the scaling tiers.

```yaml
# Modules (resolved via importlib at runtime)
architecture: "06_conv_vae"
hmm:          "02_gaussian_hmm"
cnv:          "06_gene_cnv_caller"
evaluation:   "04_kpsc_evaluation"

# Data (per-tier varies)
store_path:           "../../../data/inputs/KpSC-expansion-5k-mq20-1000bp-npy"
plasmid_store_path:   "../../../data/inputs/KpSC-expansion-5k-mq20-plasmid-1000bp-npy"
out_dir:              "../../../data/results/32_kpsc_expansion_5k"

# Ground truth
kpsc_gt_path:           "../../../assets/amrfinder_gt_expansion.tsv"
kpsc_kleborate_gt_path: "../../../assets/kpsc_expansion_kleborate_gt_runlevel.tsv"
kpsc_meta_path:         "../../../assets/kpsc_expansion_metadata_runlevel.tsv"

# Plasmid genes
plasmid_gene_coords_path: "../../../assets/plasmid_refs/plasmid_gene_coords.tsv"
pcn_absent_threshold:     0.20
pcn_amp_threshold:        1.50

# VAE
latent_dim:    10
batch_size:    128
epochs:        150
lr:            1.0e-4
weight_decay:  1.0e-5
max_beta:      1.0
warmup_epochs: 20
patience:      20

# HMM
hmm_n_states:        6
hmm_self_transition: 0.80
hmm_low_cov_threshold: 10

# Chromosomal CNV caller
cnv_min_cn1_proportion:         0.55
cnv_min_confidence:             0.50
cnv_flank_padding:              100000
cnv_crr_amp_threshold:          1.75
cnv_crr_gate_threshold:         1.75
cnv_crr_min_bins_fallback:      3
cnv_min_gene_coverage_fraction: 0.50

eval_min_group_n: 10
```

## Architecture: 1D Conv-VAE

`models/architecture/06_conv_vae.py`. Encoder takes the per-sample 5,334-bin
vector through three 1D convolutions + a dense head, producing a 10-dim
latent. Decoder mirrors the encoder with transposed convs back to bin space.

Training minimises a weighted ELBO with **β-warmup** (β=0 → β=1 over the
first 20 epochs) plus a **CNV-pattern alignment auxiliary loss** at weight
1.0 (warmup 30 epochs). The auxiliary loss pulls the latent toward
biologically-meaningful structure (preventing the VAE from collapsing to a
global-depth-only representation).

## Segmenter: Gaussian HMM

`models/hmm/02_gaussian_hmm.py`. Per-sample inference reconstructions are
re-normalised and segmented with a **6-state Gaussian HMM** (state means
initialised at CN ∈ {0, 0.5, 1, 1.5, 2, 3}, self-transition 0.8). Low-coverage
bins (<10 reads) are masked and re-imputed from neighbours.

## Output artefacts

After `train.py` completes, `out_dir/` contains:

```
checkpoint.pth                model + optimiser state at best epoch
training_log.tsv              per-epoch loss curves
reconstructions.npy           (n_samples, n_bins) imputed depth
latents.npy                   (n_samples, 10)
segments.parquet              per-sample CN segments
gene_calls.tsv                per-sample chromosomal-gene CN calls
plasmid_gene_calls.tsv        per-sample plasmid-gene CN calls
evaluation.txt                MCC/FNR/PPV by gene and by ST
```

`evaluation.txt` is the headline artefact for the manuscript — see
{doc}`07_evaluation`.