dependencies & environment
rv manages Python dependencies and virtual environments automatically. this page explains how it works end-to-end and how to avoid common pitfalls.
how the venv works
when you run rv run python train.py, rv:
- syncs your local project to the cluster via rsync (git-tracked files only)
- creates a persistent venv at
/scratch/{user}/.rv/envs/{project}/{branch}/if one doesn't exist - installs dependencies from
requirements.txtorpyproject.tomlusinguv pip install - creates an immutable snapshot of your code (hardlink copy)
- generates a SLURM script that activates the venv,
cds into the snapshot, and runs your command
the generated job script looks roughly like this:
#!/bin/bash
#SBATCH ...
# load Python 3.12 + CUDA
module load cuda/12.8.0 miniforge/24.11.3-py3.12
# activate rv's venv
source /scratch/{user}/.rv/envs/{project}/{branch}/bin/activate
# set environment variables (caches, output dirs, etc.)
export OMP_NUM_THREADS=...
export HF_HOME=...
export RV_OUTPUT_DIR=...
export RV_CHECKPOINT_DIR=...
# cd into the snapshot
cd /scratch/{user}/rv-workspaces/{project}/{branch}/snapshots/{jobName}-{timestamp}/
# run your command
python train.pypython resolves to the venv's Python 3.12 (not the system's Python 3.6), torchrun, accelerate, deepspeed, and other entry points installed by your dependencies are on PATH, and relative paths resolve against the snapshot directory containing your synced project files.use relative paths for scripts
your command runs from within the workspace snapshot, which is a copy of your project. use relative paths for scripts, config files, and anything that's part of your project:
# correct — relative paths resolve within the snapshot
rv run -t a100 -- torchrun --nproc_per_node=4 train.py
rv run python eval.py --config configs/eval.yaml
rv run python -m mypackage.train# WRONG — absolute paths bypass the workspace and may bypass the venv
rv run torchrun /scratch/user/sft/train_sft.py
rv run python /scratch/user/some_script.pywhy absolute script paths are dangerous:
- the script may
importpackages that only exist in the venv. iftorchrunorpythonresolves to a system binary, those imports fail - the script's working directory expectations break — relative paths inside the script won't find your config files
- rv's file sync and snapshot system is designed so your command runs from a complete copy of your project. using absolute paths to scripts elsewhere defeats this
absolute paths are fine for data — reading datasets, model weights, or other files on /scratch/ or /standard/ from within your script is expected:
# this is fine — data paths can be absolute
model = AutoModel.from_pretrained("/scratch/user/.cache/huggingface/models/...")
dataset = load_dataset("json", data_files="/scratch/user/data/train.jsonl")how torchrun works with rv
rv doesn't do anything magic with torchrun. it just ensures the venv is activated before your command runs, which puts the venv's torchrun on PATH.
# what PATH looks like inside the job:
/scratch/{user}/.rv/envs/{project}/{branch}/bin ← venv (torchrun, python, pip, etc.)
/apps/software/.../miniforge/24.11.3-py3.12/bin ← module-loaded Python
/usr/bin ← system (Python 3.6, DO NOT USE)since torch is installed in the venv (from your requirements.txt or pyproject.toml), torchrun is in the venv's bin/ directory and resolves first.
rv also auto-injects --master-port=$MASTER_PORT (a per-job unique port) to prevent collisions when multiple jobs land on the same node. for multi-node jobs, rv additionally sets --nnodes, --node-rank, and --master-addr.
# single-node multi-GPU
rv run -g 2 -t a6000 -- torchrun --nproc_per_node=2 train.py
# multi-node (rv handles srun + torchrun coordination)
rv run -g 8 -t a100 -- torchrun --nproc_per_node=4 train.pydeepspeed configs can use relative paths since the cwd is the workspace snapshot:
rv run -g 4 -t a100 -- deepspeed --num_gpus=4 train.py --deepspeed ds_config.jsontwo-phase dependency install
rivanna's login node has no GPU and an older compiler toolchain. most Python packages install fine there, but some (flash-attn, auto-gptq, mamba-ssm, triton kernels) need CUDA or a modern GCC to compile. rv handles this with a two-phase strategy:
phase 1 (login node, runs during rv run before job submission)
- creates the venv using
uv venvwith module-loaded Python 3.12 - runs
uv pip install -r requirements.txt(or-e .for pyproject.toml) - if any package fails (typically CUDA-dependent ones), falls back to per-package install, skipping failures
- writes a
.needs-phase2marker if any packages were skipped
phase 2 (compute node, runs at job start before your command)
- only runs if
.needs-phase2exists - loads
gcc/11.4.0for compilation - retries the full install with
--no-build-isolation(so packages can find CUDA, torch, etc.) - removes the marker on success
this means CUDA-dependent packages install the first time a job runs on a GPU node. the venv is persistent, so subsequent jobs skip this step (unless your deps file changes).
what triggers a reinstall
- rv hashes your deps file (SHA-256). if the hash changes, it reinstalls
- deleting the venv manually:
rv exec "rm -rf /scratch/$USER/.rv/envs/{project}/{branch}" - first run on a new branch (each branch gets its own venv)
adding extra dependencies
for packages beyond what's in your deps file:
# option 1: add to your requirements.txt or pyproject.toml (recommended)
# rv reinstalls automatically when the file changes
# option 2: inline pip install in a wrapper script
rv run python -c "
import subprocess, sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sae-lens'])
from my_module import train
train()
"
# option 3: pip install in a shell-wrapped command
rv run "pip install sae-lens && python train.py"option 1 is strongly preferred because the packages persist across runs and are part of your git-tracked project.
shell commands and dependency management
rv only auto-manages dependencies when the command starts with python or python3. shell commands are treated as opaque:
| command | deps managed? | venv active? |
|---|---|---|
| rv run python train.py | yes | yes |
| rv run python -m torch.distributed.run train.py | yes | yes |
| rv run torchrun train.py | yes | yes |
| rv run "bash train.sh" | no | yes |
| rv run "make train" | no | yes |
the venv is always activated regardless of command type. the difference is whether rv runs uv pip install before submitting. if you use shell commands, your deps file is ignored but the venv's already-installed packages are still available.
the system Python is ancient
rivanna's system Python is 3.6 (/usr/bin/python3). it cannot run modern ML code. rv loads Python 3.12 via module load miniforge/24.11.3-py3.12 and creates the venv from that.
if you see errors like:
SyntaxError: f-string expressions cannot include backslashesModuleNotFoundError: No module named 'dataclasses'you're running on the system Python. this usually means the venv wasn't activated — see troubleshooting below.
quick verification pattern
before burning a real GPU allocation, verify your environment on a free MIG slice:
# test_env.py
import sys
print(f"Python: {sys.executable}")
print(f"Version: {sys.version}")
# verify key imports
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
# add your project's critical imports
from transformers import AutoModel
print("All imports OK")rv run --mig --name test-deps python test_env.py
rv logs -f test-depsthe output should show:
- Python executable under
/scratch/{user}/.rv/envs/...(not/usr/bin/python3) - Python 3.12.x
- CUDA available: True
troubleshooting
ModuleNotFoundError
check which Python is running by adding this to your script temporarily:
import sys
print(f"executable: {sys.executable}", flush=True)
print(f"version: {sys.version}", flush=True)| sys.executable shows | cause | fix |
|---|---|---|
| /usr/bin/python3 | system Python, venv not activated | use rv run, not rv exec. use relative paths |
| ~/.local/bin/python | user-installed Python | same as above |
| /scratch/.../.rv/envs/.../bin/python | correct venv, package missing | add package to requirements.txt |
the most common cause: using an absolute path to a script outside the workspace, or running directly on the cluster instead of through rv run.
GLIBCXX_3.4.29 not found
the system's libstdc++ is too old for packages compiled against newer GCC. set LD_LIBRARY_PATH to include a newer GCC's libraries:
rv env set LD_LIBRARY_PATH /apps/software/standard/core/gcc/14.2.0/lib64this persists across all future jobs. verify the path exists first: rv exec "ls /apps/software/standard/core/gcc/14.2.0/lib64/libstdc++.so*"
phase 2 install fails
CUDA-dependent packages fail on the compute node. check stderr:
rv logs <id> --err- pin a compatible version:
flash-attn==2.5.8instead offlash-attn - ensure torch is listed before CUDA-dependent packages in requirements.txt (phase 2 needs torch importable)
- try a pre-built wheel: add
--find-linksto a pip install in your script
stale venv
if you've made significant changes to deps and the venv seems broken:
# delete the venv — next rv run recreates it
rv exec "rm -rf /scratch/$USER/.rv/envs/{project}/{branch}"
# check what exists
rv exec "ls /scratch/$USER/.rv/envs/"don't manually pip install into the system Python
# WRONG — installs into system Python 3.6, requires sudo, breaks things
rv exec "pip install torch"
rv exec "pip3 install transformers"
rv exec "/usr/bin/pip install ..."instead: add dependencies to your requirements.txt or pyproject.toml and let rv handle installation. if you need to install something interactively, get a shell on a compute node first:
rv up --mig
# now you're in a shell with the venv activated
pip install some-package # installs into rv's venv