guides
tips, GPU training best practices, and troubleshooting — all verified on Rivanna.
tips & gotchas
argument ordering
rv options must come before the command. anything after is passed through. rv warns if it detects misplaced flags.
rv run -g 4 -t a100 python train.py ✓
rv run python train.py -g 4 -t a100 ✗
file sync
rv run uploads your current directory. only git-tracked files sync. each job gets an immutable snapshot. use .rvignore to exclude extra files.
save outputs to persistent storage
your job runs inside an ephemeral snapshot (pruned after 7 days). use RV_OUTPUT_DIR (set automatically), the --output flag, or write to /scratch/ directly.
output buffering
rv auto-sets PYTHONUNBUFFERED=1. if you still see no output, check rv logs --err — the job may have crashed.
rv exec is login-node only
no GPU access. use for file checks and queries. for GPU utilization, use rv gpu.
backfill scheduling
jobs under 3 hours qualify for backfill — often near-instant allocation. the default walltime (2:59:00) is set just below this threshold.
use MIG as a pre-flight check
validate your full pipeline on a free MIG slice before requesting expensive GPUs. MIG has 10GB VRAM — enough to catch import errors, config bugs, and path issues:
rv run --mig python train.py # free, instant — catches 90% of bugs
rv run -t a100 --time 6h python train.py # submit the real run after validationqueue times
| GPU | typical wait | SU/GPU-hr | VRAM |
|---|---|---|---|
| mig | instant | FREE | 10 GB |
| v100 | ~3 days | 21 | 32 GB |
| a6000 | ~18 hours | 143 | 48 GB |
| a100 | ~10 hours | 509 | 80 GB |
| h200 | varies | 817 | 141 GB |
check real-time availability with rv status.
--mem based on your GPU count and node specs. override with --mem 200G if you need more (e.g., large dataset loading, preprocessing). rule of thumb: 2-3x your total VRAM is safe for most training workloads.training overview
single GPU
rv run --mig python train.py # free MIG slice
rv run -g 1 -t a6000 python train.py # dedicated GPUmulti-GPU (DDP/FSDP)
rv run -g 2 -t a6000 -- torchrun --nproc_per_node=2 train.pymulti-node
rv run -g 4 -t a100 -- torchrun --nproc_per_node=2 train.pyrv handles srun + torchrun coordination automatically.
running inference
for large model inference (not training), use device_map="auto" to shard across GPUs on a single node.
rv run -g 4 -t a100 --single-node python generate.py--single-node prevents multi-node strategies. device_map="auto" only shards within one node — multi-node would duplicate your workload. rv auto-detects inference scripts and skips multi-node automatically.
saving results
use RV_OUTPUT_DIR (set automatically in every job) or --output:
import os
output_dir = os.environ.get("RV_OUTPUT_DIR", "./results")
os.makedirs(output_dir, exist_ok=True)rv run -g 4 --output ./results python generate.pyprocess groups
NCCL for GPU tensors, Gloo for CPU tensors. wrong backend = silent hang.
dist.init_process_group("nccl") # GPU default
# CPU collectives (strings, dicts, metadata):
cpu_group = dist.new_group(backend="gloo")
dist.all_gather_object(output_list, my_dict, group=cpu_group)DDP
model = DDP(model, device_ids=[local_rank])- never call
model.module.forward()directly — bypasses gradient sync find_unused_parameters=Truefor conditional/multi-head models- save with
model.module.state_dict()(unwrap DDP) sampler.set_epoch(epoch)in every epoch for correct shuffling
FSDP
| strategy | what's sharded | memory | speed |
|---|---|---|---|
| FULL_SHARD | params + grads + optimizer | lowest (63% savings) | slowest |
| SHARD_GRAD_OP | grads + optimizer | medium | medium |
| NO_SHARD | nothing (same as DDP) | highest | fastest |
wrapping policy
# NEVER use always_wrap_policy
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=1_000_000)
model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)mixed precision
mp_policy = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32, # gradient reduction in fp32
buffer_dtype=torch.bfloat16,
)
model = FSDP(model, mixed_precision=mp_policy)mixed precision
# BF16 (A100/H200) — no GradScaler needed
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
loss = loss_fn(model(input), target)
loss.backward()
optimizer.step()
# FP16 (older GPUs) — needs GradScaler
scaler = torch.amp.GradScaler("cuda")
with torch.amp.autocast("cuda", dtype=torch.float16):
loss = loss_fn(model(input), target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()checkpointing
RV_CHECKPOINT_DIR is set automatically in every job, keyed by job name (not job ID). jobs submitted with the same --name share the same checkpoint directory, so resuming across runs works automatically.
import os
ckpt_dir = os.environ.get("RV_CHECKPOINT_DIR", "./checkpoints")
path = f"{ckpt_dir}/latest.pt"
# save (rank 0 only)
checkpoint = {
'model': model.module.state_dict(),
'optimizer': optimizer.state_dict(),
'epoch': epoch,
'rng_cpu': torch.random.get_rng_state(),
'rng_cuda': torch.cuda.get_rng_state(device),
}
if rank == 0: torch.save(checkpoint, path)
dist.barrier()
# load — MUST use map_location='cpu' and weights_only=False
ckpt = torch.load(path, map_location='cpu', weights_only=False)weights_only=Truefails on optimizer/RNG statesmap_location='cuda:0'breaks RNG restore- always save
model.module.state_dict()(unwrapped) - skipping optimizer state means Adam forgets momentum after resume
RLHF & GRPO
# GRPO: generate G completions, normalize rewards within group
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -(log_probs * advantages).mean() + kl_coef * kl- reference model must be frozen:
ref_model.eval() - multi-GPU: sync rewards and advantages across ranks
- OpenRLHF needs Gloo for CPU reward aggregation
- memory: 4x model size minimum (actor + critic + reward + ref)
debugging
hangs
- mismatched collectives across ranks → deadlock
- missing
dist.barrier()on some ranks - NCCL timeout — increase with
NCCL_TIMEOUT=1800 - data loader length mismatch — one rank finishes early
OOM
- gradient accumulation → smaller micro-batch
- mixed precision → halves activation memory
- activation checkpointing → trades compute for memory
- FSDP FULL_SHARD → 63% memory savings
troubleshooting
job failed — where to start?
check stderr first — it's almost always where the real error is:
rv logs <jobId> --err # stderr (errors, tracebacks)
rv logs <jobId> # stdout (training output)job stuck in PENDING
check rv status, try a different GPU type, use --mig for instant free allocation, or reduce --time below 3h for backfill.
no output files
check rv logs --err for errors. use RV_OUTPUT_DIR (set automatically), --output flag, or write to /scratch/. don't use /tmp/ (node-local). the job CWD is the snapshot, not your live code.
can't find results
rv exec "ls /scratch/USER/.rv/logs/" # log files
rv exec "ls /scratch/USER/rv-workspaces/" # synced code
rv sync pull /remote/path ./local/ # download