configuration
rv stores its configuration in ~/.rv/config.toml. created automatically by rv init.
config file
full example of ~/.rv/config.toml:
[connection]
host = "rivanna"
user = "abc1de"
hostname = "rivanna.hpc.virginia.edu"
[defaults]
account = "mygroup"
gpu_type = "a6000"
time = "2:59:00"
partition = "gpu-a6000"
ai_naming = true
ai_api_key = "sk-..."
[paths]
scratch = "/scratch/abc1de"
home = "/home/abc1de"
[notifications]
enabled = true
email = "abc1de@virginia.edu"
token = "..." # auto-generated by rv init
[scratch_keepalive]
enabled = true # default: true
[shared]
hf_cache = "/standard/mygroup/.cache/huggingface" # optionaldefaults
default values used when flags are not specified on the command line:
| key | description | example |
|---|---|---|
| account | Slurm account (allocation group) | mygroup |
| gpu_type | default GPU type when --type is omitted | a6000 |
| time | default walltime | 2:59:00 |
| partition | default Slurm partition | gpu-a6000 |
notifications
rv can email you when jobs complete or fail. notifications are sent to your @virginia.edu address via HMAC-signed webhooks.
enable during rv init or set manually in config.toml:
[notifications]
enabled = true
email = "abc1de@virginia.edu"notification events: COMPLETED, FAILED, TIMEOUT, RESUBMITTED (checkpoint-restart).
ai naming
rv can auto-generate descriptive job names using an AI model. set your API key in config.toml — rv auto-detects the provider from the key prefix:
| key prefix | provider |
|---|---|
| sk-ant- | Anthropic |
| sk- | OpenAI |
[defaults]
ai_naming = true
ai_api_key = "sk-ant-..."environment variables
use rv env to manage variables that are injected into every job. useful for API keys and other secrets. env vars are global — use config files (Hydra, argparse) for experiment-specific settings.
rv env import .env
rv env set HF_TOKEN hf_abc123...rv env listrv env rm HF_TOKENrv also auto-sets these variables in every job: OMP_NUM_THREADS, TOKENIZERS_PARALLELISM, HF_HOME, VLLM_CACHE_DIR, WANDB_DIR, WANDB_DATA_DIR, TRITON_CACHE_DIR, TORCH_HOME, RV_CHECKPOINT_DIR, RV_OUTPUT_DIR.
dependencies
rv auto-detects requirements.txt or pyproject.toml in your project and manages dependencies automatically. you don't need to install packages manually or manage virtual environments.
for a comprehensive guide covering the venv lifecycle, relative vs absolute paths, two-phase install, torchrun integration, and troubleshooting, see dependencies & environment.
how it works
- rv creates a persistent venv at
/scratch/user/.rv/envs/{project}/{branch}/ - installs dependencies using
uv pip install(notuv sync— these are different uv workflows) - the venv is activated automatically in every job via
source .../bin/activate
your script runs with the venv already active — use bare python, not uv run python.
two-phase install
most packages install on the login node (fast). packages needing CUDA compilation (flash-attn, auto-gptq) are automatically deferred to the compute node where GPU + gcc are available. this happens transparently — no configuration needed.
additional packages
if you need packages beyond your deps file (e.g., optional extras), add pip install to your script. it installs into rv's active venv and persists across runs:
# in your job script
pip install sae-lens bitsandbytes # installs into rv's venv, cached for next run
python train.pywhat NOT to do
- don't use
uv syncoruv run— these create a separate.venvand conflict with rv's venv - don't
unset VIRTUAL_ENV— rv's activation is correct - don't create your own venv — rv manages this per-branch
shell-string commands (e.g., rv run "make train") skip dependency management entirely — rv treats them as opaque and you own your environment. however, commands starting with python/ python3 (e.g., rv run python -c "...") still get deps installed and venv activated.
cache symlinks
rv init automatically creates symlinks from ~/.cache/{name} to /scratch/user/.cache/{name} for all managed cache directories (huggingface, uv, pip, wandb, triton, torch). this ensures that any tool — not just rv-managed jobs — writes caches to scratch instead of filling your home directory quota.
if ~/.cache/{name} already exists as a real directory, rv migrates the contents to scratch and replaces it with a symlink. re-running rv init updates symlinks without data loss.
this prevents "Disk quota exceeded" errors from ~/.cache/ bloat. the scratch keepalive protects these directories from the 90-day purge.
scratch keepalive
Rivanna's scratch filesystem has a 90-day purge policy — files not accessed in 90 days are automatically deleted. this would destroy your venv, environment variables, and caches.
rv automatically prevents this by touching all files under /scratch/user/.rv/ and cache directories once per day (on the first rv command of the day). this runs in the background and adds no latency to your commands.
enabled by default. to disable:
[scratch_keepalive]
enabled = falsepaths
rv organizes remote files under your scratch directory:
| path | purpose |
|---|---|
| /scratch/user/.rv/ | rv home directory |
| /scratch/user/.rv/logs/ | job output logs |
| .../{jobName}-{jobId}.{out,err} | single-node log files |
| .../{jobName}-{jobId}.node{N}.{out,err} | per-node log files (multi-node jobs) |
| /scratch/user/.rv/outputs/ | persistent job output files (RV_OUTPUT_DIR) |
| /scratch/user/.rv/checkpoints/ | persistent checkpoint files (RV_CHECKPOINT_DIR) |
| .../{jobName}/ | per-name checkpoint directory (shared across runs) |
| /scratch/user/.rv/env/ | environment variable files |
| /scratch/user/.rv/envs/{project}/{branch}/ | per-project, per-branch Python venv |
| /scratch/user/rv-workspaces/{project}/{branch}/ | per-project, per-branch workspace root |
| .../code/ | mutable workspace (sync target) |
| .../snapshots/ | per-job immutable snapshots |
| /scratch/user/.cache/huggingface/ | HuggingFace model cache (HF_HOME) |
| /scratch/user/.cache/uv/ | uv package cache (UV_CACHE_DIR) |
| /scratch/user/.cache/pip/ | pip package cache (PIP_CACHE_DIR) |
| /scratch/user/.cache/wandb/ | Weights & Biases cache (WANDB_DIR, WANDB_DATA_DIR) |
| /scratch/user/.cache/triton/ | Triton kernel cache (TRITON_CACHE_DIR) |
| /scratch/user/.cache/torch/ | PyTorch hub cache (TORCH_HOME) |
scratch storage is high-performance (Weka filesystem, ~1.5 GB/s write). files are not backed up and subject to a 90-day purge policy. ~/.cache/{huggingface,uv,pip,wandb,triton,torch} are symlinked to their scratch equivalents by rv init, so all tools share one cache location on scratch.
workspace isolation
rv workspaces are git-aware. when you run rv run or rv sync push from a git repository, rv automatically detects your current branch and organizes remote files by project and branch:
.../rv-workspaces/myproject/main/code/ ← branch "main"
.../rv-workspaces/myproject/feature--foo/code/ ← branch "feature/foo"switching branches and running rv run won't overwrite code from a different branch.
snapshots
each rv run job gets its own immutable snapshot of the code directory. snapshots use hardlinks, making them instant and zero-cost until files change. this prevents a subsequent run or sync from corrupting a running job's files. snapshots older than 7 days are automatically pruned.
sync behavior
when syncing from a git repo, rv only transfers git-tracked files (staged, committed, and untracked non-ignored). this is faster than rsync filtering and ensures only relevant files are synced. if git is not available, rv falls back to .gitignore and .rvignore filtering.
non-git projects
projects without a .git directory use _default as the branch name. snapshots and venvs still work the same way.
branch name sanitization
branch names are sanitized for filesystem safety: / becomes --, unsafe characters are stripped, and names are truncated to 80 characters. detached HEAD states use detached-{commitHash}.
multi-node logging
multi-node jobs (4+ GPUs across multiple nodes) produce per-node log files. each node's stdout and stderr are written to separate files:
rv-job-12345.node0.out # node 0 stdout
rv-job-12345.node0.err # node 0 stderr
rv-job-12345.node1.out # node 1 stdout
rv-job-12345.node1.err # node 1 stderrsingle-node jobs are unchanged — they use a single .out/.err pair as before.
rv logs automatically detects per-node files and shows merged output with [node0], [node1] prefixes. use --node <index> to filter to a specific node:
rv logs 12345 # merged view (all nodes)
rv logs 12345 --node 0 # node 0 only
rv logs 12345 --node 1 # node 1 only
rv logs 12345 --err # stderr from all nodesrv logs --pull downloads all per-node files. rv logs -f streams per-node output in real time.