rivanna.dev

configuration

rv stores its configuration in ~/.rv/config.toml. created automatically by rv init.

config file

full example of ~/.rv/config.toml:

[connection]
host = "rivanna"
user = "abc1de"
hostname = "rivanna.hpc.virginia.edu"

[defaults]
account = "mygroup"
gpu_type = "a6000"
time = "2:59:00"
partition = "gpu-a6000"
ai_naming = true
ai_api_key = "sk-..."

[paths]
scratch = "/scratch/abc1de"
home = "/home/abc1de"

[notifications]
enabled = true
email = "abc1de@virginia.edu"
token = "..."  # auto-generated by rv init

[scratch_keepalive]
enabled = true  # default: true

[shared]
hf_cache = "/standard/mygroup/.cache/huggingface"  # optional

defaults

default values used when flags are not specified on the command line:

key	description	example
account	Slurm account (allocation group)	mygroup
gpu_type	default GPU type when --type is omitted	a6000
time	default walltime	2:59:00
partition	default Slurm partition	gpu-a6000

notifications

rv can email you when jobs complete or fail. notifications are sent to your @virginia.edu address via HMAC-signed webhooks.

enable during rv init or set manually in config.toml:

[notifications]
enabled = true
email = "abc1de@virginia.edu"

notification events: COMPLETED, FAILED, TIMEOUT, RESUBMITTED (checkpoint-restart).

ai naming

rv can auto-generate descriptive job names using an AI model. set your API key in config.toml — rv auto-detects the provider from the key prefix:

key prefix	provider
sk-ant-	Anthropic
sk-	OpenAI

[defaults]
ai_naming = true
ai_api_key = "sk-ant-..."

environment variables

use rv env to manage variables that are injected into every job. useful for API keys and other secrets. env vars are global — use config files (Hydra, argparse) for experiment-specific settings.

rv env import .env
rv env set HF_TOKEN hf_abc123...

rv env list

rv env rm HF_TOKEN

rv also auto-sets these variables in every job: OMP_NUM_THREADS, TOKENIZERS_PARALLELISM, HF_HOME, VLLM_CACHE_DIR, WANDB_DIR, WANDB_DATA_DIR, TRITON_CACHE_DIR, TORCH_HOME, RV_CHECKPOINT_DIR, RV_OUTPUT_DIR.

dependencies

rv auto-detects requirements.txt or pyproject.toml in your project and manages dependencies automatically. you don't need to install packages manually or manage virtual environments.

for a comprehensive guide covering the venv lifecycle, relative vs absolute paths, two-phase install, torchrun integration, and troubleshooting, see dependencies & environment.

how it works

rv creates a persistent venv at /scratch/user/.rv/envs/{project}/{branch}/
installs dependencies using uv pip install (not uv sync — these are different uv workflows)
the venv is activated automatically in every job via source .../bin/activate

your script runs with the venv already active — use bare python, not uv run python.

two-phase install

most packages install on the login node (fast). packages needing CUDA compilation (flash-attn, auto-gptq) are automatically deferred to the compute node where GPU + gcc are available. this happens transparently — no configuration needed.

additional packages

if you need packages beyond your deps file (e.g., optional extras), add pip install to your script. it installs into rv's active venv and persists across runs:

# in your job script
pip install sae-lens bitsandbytes    # installs into rv's venv, cached for next run
python train.py

what NOT to do

don't use uv sync or uv run — these create a separate .venv and conflict with rv's venv
don't unset VIRTUAL_ENV — rv's activation is correct
don't create your own venv — rv manages this per-branch

shell-string commands (e.g., rv run "make train") skip dependency management entirely — rv treats them as opaque and you own your environment. however, commands starting with python/ python3 (e.g., rv run python -c "...") still get deps installed and venv activated.

cache symlinks

rv init automatically creates symlinks from ~/.cache/{name} to /scratch/user/.cache/{name} for all managed cache directories (huggingface, uv, pip, wandb, triton, torch). this ensures that any tool — not just rv-managed jobs — writes caches to scratch instead of filling your home directory quota.

if ~/.cache/{name} already exists as a real directory, rv migrates the contents to scratch and replaces it with a symlink. re-running rv init updates symlinks without data loss.

this prevents "Disk quota exceeded" errors from ~/.cache/ bloat. the scratch keepalive protects these directories from the 90-day purge.

scratch keepalive

Rivanna's scratch filesystem has a 90-day purge policy — files not accessed in 90 days are automatically deleted. this would destroy your venv, environment variables, and caches.

rv automatically prevents this by touching all files under /scratch/user/.rv/ and cache directories once per day (on the first rv command of the day). this runs in the background and adds no latency to your commands.

enabled by default. to disable:

[scratch_keepalive]
enabled = false

shared storage

if you're part of a lab group with access to persistent group storage (/standard/ or /project/), rv can use it as a shared HuggingFace model cache. this avoids every lab member downloading the same large models to their own scratch.

rv init automatically detects group directories via hdquota and offers to set this up. if you already have models in /scratch/user/.cache/huggingface, rv will offer to migrate them to the shared location. rv also checks the shared filesystem's capacity — if it's over 80% full, you'll see a warning before proceeding.

you can also configure it manually:

[shared]
hf_cache = "/standard/mygroup/.cache/huggingface"

the shared directory is created with group-writable setgid permissions (chmod g+rwxs) so all lab members can read and write models. group storage is persistent — not subject to the 90-day scratch purge. when active, both ~/.cache/huggingface and /scratch/user/.cache/huggingface are symlinked to the shared location.

paths

rv organizes remote files under your scratch directory:

path	purpose
/scratch/user/.rv/	rv home directory
/scratch/user/.rv/logs/	job output logs
.../{jobName}-{jobId}.{out,err}	single-node log files
.../{jobName}-{jobId}.node{N}.{out,err}	per-node log files (multi-node jobs)
/scratch/user/.rv/outputs/	persistent job output files (RV_OUTPUT_DIR)
/scratch/user/.rv/checkpoints/	persistent checkpoint files (RV_CHECKPOINT_DIR)
.../{jobName}/	per-name checkpoint directory (shared across runs)
/scratch/user/.rv/env/	environment variable files
/scratch/user/.rv/envs/{project}/{branch}/	per-project, per-branch Python venv
/scratch/user/rv-workspaces/{project}/{branch}/	per-project, per-branch workspace root
.../code/	mutable workspace (sync target)
.../snapshots/	per-job immutable snapshots
/scratch/user/.cache/huggingface/	HuggingFace model cache (HF_HOME)
/scratch/user/.cache/uv/	uv package cache (UV_CACHE_DIR)
/scratch/user/.cache/pip/	pip package cache (PIP_CACHE_DIR)
/scratch/user/.cache/wandb/	Weights & Biases cache (WANDB_DIR, WANDB_DATA_DIR)
/scratch/user/.cache/triton/	Triton kernel cache (TRITON_CACHE_DIR)
/scratch/user/.cache/torch/	PyTorch hub cache (TORCH_HOME)

scratch storage is high-performance (Weka filesystem, ~1.5 GB/s write). files are not backed up and subject to a 90-day purge policy. ~/.cache/{huggingface,uv,pip,wandb,triton,torch} are symlinked to their scratch equivalents by rv init, so all tools share one cache location on scratch.

workspace isolation

rv workspaces are git-aware. when you run rv run or rv sync push from a git repository, rv automatically detects your current branch and organizes remote files by project and branch:

.../rv-workspaces/myproject/main/code/          ← branch "main"
.../rv-workspaces/myproject/feature--foo/code/   ← branch "feature/foo"

switching branches and running rv run won't overwrite code from a different branch.

snapshots

each rv run job gets its own immutable snapshot of the code directory. snapshots use hardlinks, making them instant and zero-cost until files change. this prevents a subsequent run or sync from corrupting a running job's files. snapshots older than 7 days are automatically pruned.

sync behavior

when syncing from a git repo, rv only transfers git-tracked files (staged, committed, and untracked non-ignored). this is faster than rsync filtering and ensures only relevant files are synced. if git is not available, rv falls back to .gitignore and .rvignore filtering.

non-git projects

projects without a .git directory use _default as the branch name. snapshots and venvs still work the same way.

branch name sanitization

branch names are sanitized for filesystem safety: / becomes --, unsafe characters are stripped, and names are truncated to 80 characters. detached HEAD states use detached-{commitHash}.

multi-node logging

multi-node jobs (4+ GPUs across multiple nodes) produce per-node log files. each node's stdout and stderr are written to separate files:

rv-job-12345.node0.out   # node 0 stdout
rv-job-12345.node0.err   # node 0 stderr
rv-job-12345.node1.out   # node 1 stdout
rv-job-12345.node1.err   # node 1 stderr

single-node jobs are unchanged — they use a single .out/.err pair as before.

rv logs automatically detects per-node files and shows merged output with [node0], [node1] prefixes. use --node <index> to filter to a specific node:

rv logs 12345              # merged view (all nodes)
rv logs 12345 --node 0     # node 0 only
rv logs 12345 --node 1     # node 1 only
rv logs 12345 --err        # stderr from all nodes

rv logs --pull downloads all per-node files. rv logs -f streams per-node output in real time.