smart allocator

rv doesn't just submit to one partition and wait. it probes the cluster, generates every compatible strategy, and submits them all in parallel. the first one to start running wins — the rest get cancelled.

how it works

  1. probe cluster — rv queries Slurm for current GPU availability, queue depth, and backfill windows across all partitions
  2. generate strategies — for your requested GPU count, rv generates all compatible combinations: GPU type, partition, single-node vs multi-node topology, direct vs backfill vs checkpoint-restart
  3. rank and prune — strategies are ranked by estimated wait time and SU cost. dominated strategies (same GPU type and topology but worse on all metrics) are pruned
  4. fan-out submit — all surviving strategies are submitted to Slurm simultaneously
  5. first wins — rv monitors all submissions. the first job to reach RUNNING state wins; all other jobs are cancelled (even if multiple strategies started running simultaneously)
rv up --dry-run # see all strategies without submitting

fan-out strategy

when you request GPUs without specifying a type, rv submits to every compatible partition at once. for example, requesting 4 GPUs might generate strategies for A6000, A40, A100 (40GB), A100 (80GB), V100, and multi-node variants (2x2) for each.

this works because Slurm allows multiple pending jobs. whichever partition has resources first wins. rv cancels the losers automatically — you're never charged for jobs that don't run.

if you specify a GPU type with --type, rv only generates strategies for that type (but still explores single-node, multi-node, direct, and backfill variants).

backfill detection

Slurm's backfill scheduler can start smaller/shorter jobs ahead of the queue if they fit in the gaps. rv detects these windows using sbatch --test-only and generates backfill strategies with --time-min set to the detected window.

this is why the default walltime of 2:59:00 is recommended — jobs under 3 hours are most likely to find backfill opportunities.

checkpoint-restart

for long-running jobs (e.g. 24h training), rv can break the work into segments that fit within backfill windows. each segment:

  1. runs your command with a timeout set to the walltime minus a 10-minute buffer
  2. sends SIGUSR1 to your process before time expires (your code should save a checkpoint)
  3. auto-resubmits the same script with RV_TOTAL_ELAPSED tracking cumulative time
  4. stops resubmitting once the total requested time has been reached

checkpoint strategies only appear when backfill windows are available but shorter than your total requested time. your training code needs to handle SIGUSR1 by saving state and resuming from the latest checkpoint on restart.

gpu types

available GPU types on Rivanna. MIG slices are free and don't consume SUs.

typeVRAMSU/GPU-hrmax/usermax/jobper node
mig10 GBfree28156
v10032 GB20.963244
rtx309024 GB113.23224
a600048 GB142.733288
a4048 GB186.693288
a100_4040 GB463.813288
a100_8080 GB508.893288
h200141 GB816.67448

A100 (80GB) nodes have InfiniBand and NVLink interconnects. use rv cost to estimate SU costs for your job configuration.