rivanna.dev

guides

tips, GPU training best practices, and troubleshooting — all verified on Rivanna.

tips & gotchas

argument ordering

rv options must come before the command. anything after is passed through. rv warns if it detects misplaced flags.

rv run -g 4 -t a100 python train.py ✓

rv run python train.py -g 4 -t a100 ✗

file sync

rv run uploads your current directory. only git-tracked files sync. each job gets an immutable snapshot. use .rvignore to exclude extra files.

save outputs to persistent storage

your job runs inside an ephemeral snapshot (pruned after 7 days). use RV_OUTPUT_DIR (set automatically), the --output flag, or write to /scratch/ directly.

output buffering

rv auto-sets PYTHONUNBUFFERED=1. if you still see no output, check rv logs --err — the job may have crashed.

rv exec is login-node only

no GPU access. use for file checks and queries. for GPU utilization, use rv gpu.

backfill scheduling

jobs under 3 hours qualify for backfill — often near-instant allocation. the default walltime (2:59:00) is set just below this threshold.

use MIG as a pre-flight check

validate your full pipeline on a free MIG slice before requesting expensive GPUs. MIG has 10GB VRAM — enough to catch import errors, config bugs, and path issues:

rv run --mig python train.py              # free, instant — catches 90% of bugs
rv run -t a100 --time 6h python train.py  # submit the real run after validation

queue times

GPU	typical wait	SU/GPU-hr	VRAM
mig	instant	FREE	10 GB
v100	~3 days	21	32 GB
a6000	~18 hours	143	48 GB
a100	~10 hours	509	80 GB
h200	varies	817	141 GB

check real-time availability with rv status.

system memory (--mem). rv auto-calculates --mem based on your GPU count and node specs. override with --mem 200G if you need more (e.g., large dataset loading, preprocessing). rule of thumb: 2-3x your total VRAM is safe for most training workloads.

training overview

single GPU

rv run --mig python train.py              # free MIG slice
rv run -g 1 -t a6000 python train.py      # dedicated GPU

multi-GPU (DDP/FSDP)

rv run -g 2 -t a6000 -- torchrun --nproc_per_node=2 train.py

multi-node

rv run -g 4 -t a100 -- torchrun --nproc_per_node=2 train.py

rv handles srun + torchrun coordination automatically.

BF16 on A100/H200 (compute capability >= 8). FP16 + GradScaler on older GPUs.

running inference

for large model inference (not training), use device_map="auto" to shard across GPUs on a single node.

rv run -g 4 -t a100 --single-node python generate.py

--single-node prevents multi-node strategies. device_map="auto" only shards within one node — multi-node would duplicate your workload. rv auto-detects inference scripts and skips multi-node automatically.

saving results

use RV_OUTPUT_DIR (set automatically in every job) or --output:

import os
output_dir = os.environ.get("RV_OUTPUT_DIR", "./results")
os.makedirs(output_dir, exist_ok=True)

rv run -g 4 --output ./results python generate.py

memory estimation: model_params x bytes_per_param x 1.1 — a 70B bf16 model needs ~147 GB VRAM.

process groups

NCCL for GPU tensors, Gloo for CPU tensors. wrong backend = silent hang.

dist.init_process_group("nccl")  # GPU default

# CPU collectives (strings, dicts, metadata):
cpu_group = dist.new_group(backend="gloo")
dist.all_gather_object(output_list, my_dict, group=cpu_group)

DDP

model = DDP(model, device_ids=[local_rank])

never call model.module.forward() directly — bypasses gradient sync
find_unused_parameters=True for conditional/multi-head models
save with model.module.state_dict() (unwrap DDP)
sampler.set_epoch(epoch) in every epoch for correct shuffling

FSDP

strategy	what's sharded	memory	speed
FULL_SHARD	params + grads + optimizer	lowest (63% savings)	slowest
SHARD_GRAD_OP	grads + optimizer	medium	medium
NO_SHARD	nothing (same as DDP)	highest	fastest

wrapping policy

# NEVER use always_wrap_policy
from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=1_000_000)
model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)

mixed precision

mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,  # gradient reduction in fp32
    buffer_dtype=torch.bfloat16,
)
model = FSDP(model, mixed_precision=mp_policy)

CPU offload saves ~29% GPU memory but is 26x slower. last resort only.

mixed precision

# BF16 (A100/H200) — no GradScaler needed
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    loss = loss_fn(model(input), target)
loss.backward()
optimizer.step()

# FP16 (older GPUs) — needs GradScaler
scaler = torch.amp.GradScaler("cuda")
with torch.amp.autocast("cuda", dtype=torch.float16):
    loss = loss_fn(model(input), target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

checkpointing

RV_CHECKPOINT_DIR is set automatically in every job, keyed by job name (not job ID). jobs submitted with the same --name share the same checkpoint directory, so resuming across runs works automatically.

import os
ckpt_dir = os.environ.get("RV_CHECKPOINT_DIR", "./checkpoints")
path = f"{ckpt_dir}/latest.pt"

# save (rank 0 only)
checkpoint = {
    'model': model.module.state_dict(),
    'optimizer': optimizer.state_dict(),
    'epoch': epoch,
    'rng_cpu': torch.random.get_rng_state(),
    'rng_cuda': torch.cuda.get_rng_state(device),
}
if rank == 0: torch.save(checkpoint, path)
dist.barrier()

# load — MUST use map_location='cpu' and weights_only=False
ckpt = torch.load(path, map_location='cpu', weights_only=False)

weights_only=True fails on optimizer/RNG states
map_location='cuda:0' breaks RNG restore
always save model.module.state_dict() (unwrapped)
skipping optimizer state means Adam forgets momentum after resume

RLHF & GRPO

# GRPO: generate G completions, normalize rewards within group
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
loss = -(log_probs * advantages).mean() + kl_coef * kl

reference model must be frozen: ref_model.eval()
multi-GPU: sync rewards and advantages across ranks
OpenRLHF needs Gloo for CPU reward aggregation
memory: 4x model size minimum (actor + critic + reward + ref)

debugging

hangs

mismatched collectives across ranks → deadlock
missing dist.barrier() on some ranks
NCCL timeout — increase with NCCL_TIMEOUT=1800
data loader length mismatch — one rank finishes early

OOM

gradient accumulation → smaller micro-batch
mixed precision → halves activation memory
activation checkpointing → trades compute for memory
FSDP FULL_SHARD → 63% memory savings

troubleshooting

job failed — where to start?

check stderr first — it's almost always where the real error is:

rv logs <jobId> --err    # stderr (errors, tracebacks)
rv logs <jobId>          # stdout (training output)

job stuck in PENDING

check rv status, try a different GPU type, use --mig for instant free allocation, or reduce --time below 3h for backfill.

no output files

check rv logs --err for errors. use RV_OUTPUT_DIR (set automatically), --output flag, or write to /scratch/. don't use /tmp/ (node-local). the job CWD is the snapshot, not your live code.

can't find results

rv exec "ls /scratch/USER/.rv/logs/"       # log files
rv exec "ls /scratch/USER/rv-workspaces/"  # synced code
rv sync pull /remote/path ./local/         # download