guides

tips, GPU training best practices, and troubleshooting — all verified on Rivanna.

tips & gotchas

argument ordering

rv options must come before the command. anything after is passed through. rv warns if it detects misplaced flags.

rv run -g 4 -t a100 python train.py ✓

rv run python train.py -g 4 -t a100 ✗

file sync

rv run uploads your current directory. only git-tracked files sync. each job gets an immutable snapshot. use .rvignore to exclude extra files.

save outputs to persistent storage

your job runs inside an ephemeral snapshot (pruned after 7 days). use RV_OUTPUT_DIR (set automatically), the --output flag, or write to /scratch/ directly.

output buffering

rv auto-sets PYTHONUNBUFFERED=1. if you still see no output, check rv logs --err — the job may have crashed.

rv exec is login-node only

no GPU access. use for file checks and queries. for GPU utilization, use rv gpu.

backfill scheduling

jobs under 3 hours qualify for backfill — often near-instant allocation. the default walltime (2:59:00) is set just below this threshold.

use MIG as a pre-flight check

validate your full pipeline on a free MIG slice before requesting expensive GPUs. MIG has 10GB VRAM — enough to catch import errors, config bugs, and path issues:

rv run --mig python train.py # free, instant — catches 90% of bugs rv run -t a100 --time 6h python train.py # submit the real run after validation

queue times

GPUtypical waitSU/GPU-hrVRAM
miginstantFREE10 GB
v100~3 days2132 GB
a6000~18 hours14348 GB
a100~10 hours50980 GB
h200varies817141 GB

check real-time availability with rv status.

system memory (--mem). rv auto-calculates --mem based on your GPU count and node specs. override with --mem 200G if you need more (e.g., large dataset loading, preprocessing). rule of thumb: 2-3x your total VRAM is safe for most training workloads.

training overview

single GPU

rv run --mig python train.py # free MIG slice rv run -g 1 -t a6000 python train.py # dedicated GPU

multi-GPU (DDP/FSDP)

rv run -g 2 -t a6000 -- torchrun --nproc_per_node=2 train.py

multi-node

rv run -g 4 -t a100 -- torchrun --nproc_per_node=2 train.py

rv handles srun + torchrun coordination automatically.

BF16 on A100/H200 (compute capability >= 8). FP16 + GradScaler on older GPUs.

running inference

for large model inference (not training), use device_map="auto" to shard across GPUs on a single node.

rv run -g 4 -t a100 --single-node python generate.py

--single-node prevents multi-node strategies. device_map="auto" only shards within one node — multi-node would duplicate your workload. rv auto-detects inference scripts and skips multi-node automatically.

saving results

use RV_OUTPUT_DIR (set automatically in every job) or --output:

import os output_dir = os.environ.get("RV_OUTPUT_DIR", "./results") os.makedirs(output_dir, exist_ok=True)
rv run -g 4 --output ./results python generate.py
memory estimation: model_params x bytes_per_param x 1.1 — a 70B bf16 model needs ~147 GB VRAM.

process groups

NCCL for GPU tensors, Gloo for CPU tensors. wrong backend = silent hang.

dist.init_process_group("nccl") # GPU default # CPU collectives (strings, dicts, metadata): cpu_group = dist.new_group(backend="gloo") dist.all_gather_object(output_list, my_dict, group=cpu_group)

DDP

model = DDP(model, device_ids=[local_rank])
  • never call model.module.forward() directly — bypasses gradient sync
  • find_unused_parameters=True for conditional/multi-head models
  • save with model.module.state_dict() (unwrap DDP)
  • sampler.set_epoch(epoch) in every epoch for correct shuffling

FSDP

strategywhat's shardedmemoryspeed
FULL_SHARDparams + grads + optimizerlowest (63% savings)slowest
SHARD_GRAD_OPgrads + optimizermediummedium
NO_SHARDnothing (same as DDP)highestfastest

wrapping policy

# NEVER use always_wrap_policy from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=1_000_000) model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)

mixed precision

mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.float32, # gradient reduction in fp32 buffer_dtype=torch.bfloat16, ) model = FSDP(model, mixed_precision=mp_policy)
CPU offload saves ~29% GPU memory but is 26x slower. last resort only.

mixed precision

# BF16 (A100/H200) — no GradScaler needed with torch.amp.autocast("cuda", dtype=torch.bfloat16): loss = loss_fn(model(input), target) loss.backward() optimizer.step() # FP16 (older GPUs) — needs GradScaler scaler = torch.amp.GradScaler("cuda") with torch.amp.autocast("cuda", dtype=torch.float16): loss = loss_fn(model(input), target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

checkpointing

RV_CHECKPOINT_DIR is set automatically in every job, keyed by job name (not job ID). jobs submitted with the same --name share the same checkpoint directory, so resuming across runs works automatically.

import os ckpt_dir = os.environ.get("RV_CHECKPOINT_DIR", "./checkpoints") path = f"{ckpt_dir}/latest.pt" # save (rank 0 only) checkpoint = { 'model': model.module.state_dict(), 'optimizer': optimizer.state_dict(), 'epoch': epoch, 'rng_cpu': torch.random.get_rng_state(), 'rng_cuda': torch.cuda.get_rng_state(device), } if rank == 0: torch.save(checkpoint, path) dist.barrier() # load — MUST use map_location='cpu' and weights_only=False ckpt = torch.load(path, map_location='cpu', weights_only=False)
  • weights_only=True fails on optimizer/RNG states
  • map_location='cuda:0' breaks RNG restore
  • always save model.module.state_dict() (unwrapped)
  • skipping optimizer state means Adam forgets momentum after resume

RLHF & GRPO

# GRPO: generate G completions, normalize rewards within group advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8) loss = -(log_probs * advantages).mean() + kl_coef * kl
  • reference model must be frozen: ref_model.eval()
  • multi-GPU: sync rewards and advantages across ranks
  • OpenRLHF needs Gloo for CPU reward aggregation
  • memory: 4x model size minimum (actor + critic + reward + ref)

debugging

hangs

  • mismatched collectives across ranks → deadlock
  • missing dist.barrier() on some ranks
  • NCCL timeout — increase with NCCL_TIMEOUT=1800
  • data loader length mismatch — one rank finishes early

OOM

  • gradient accumulation → smaller micro-batch
  • mixed precision → halves activation memory
  • activation checkpointing → trades compute for memory
  • FSDP FULL_SHARD → 63% memory savings

troubleshooting

job failed — where to start?

check stderr first — it's almost always where the real error is:

rv logs <jobId> --err # stderr (errors, tracebacks) rv logs <jobId> # stdout (training output)

job stuck in PENDING

check rv status, try a different GPU type, use --mig for instant free allocation, or reduce --time below 3h for backfill.

no output files

check rv logs --err for errors. use RV_OUTPUT_DIR (set automatically), --output flag, or write to /scratch/. don't use /tmp/ (node-local). the job CWD is the snapshot, not your live code.

can't find results

rv exec "ls /scratch/USER/.rv/logs/" # log files rv exec "ls /scratch/USER/rv-workspaces/" # synced code rv sync pull /remote/path ./local/ # download