Zero to MNIST
On This Page
Who this is for
New Betty users (especially faculty and their groups) who want a minimal, working GPU training example using MNIST with Slurm.
0) Prerequisites (Faculty signup + access)
- Faculty (PI) must complete ColdFront Training
- User access – your PI will have to add you to their ColdFront project
Tip: If you can’t submit Slurm jobs later, you’re probably missing an association to a Slurm account. Ask your PI to add you to their ColdFront Project allocation.
1) SSH into Betty
# Replace <PennKey> with your PennKey username
kinit <PennKey>@UPENN.EDU
ssh <PennKey>@slurm_login.parcc.upenn.edu
Follow up: Checkout Logging-In to learn more about options that you have.
2) Make a workspace in your home folder
We’ll put everything under ~/betty-mnist to keep it tidy.
mkdir -p ~/betty-mnist
cd ~/betty-mnist
3) Load Conda from the module & create your env in home
We’ll place the env at
$HOME/envs/betty-mnistso it’s clearly user-owned. For future projects, you can leave this environment here or have one in a lab project folder to be shared.Take a careful note about the version of cuda that is being requested. The cluster only supports CUDA 12.8 currently.
# every new session
module load anaconda3
# enable `conda activate` in non-interactive shells
source "$(conda info --base)/etc/profile.d/conda.sh"
mkdir -p "$HOME/envs"
# --- GPU (when you have GPU allocation) ---
# (Make sure your job requests GPUs; see the sbatch below.)
conda create -y -p "$HOME/envs/betty-mnist" python=3.11 uv -c conda-forge
conda activate "$HOME/envs/betty-mnist"
uv pip install torch torchvision torchmetrics --index-url https://download.pytorch.org/whl/cu128
Follow up: Checkout MAMBA or other methods for managing your software with our Software Tutorials
4) MNIST training script
This is a simple MNIST training script that handles loading data to GPU with a basic network. PyTorch tutorials will be best for exploring the construction of this example. You can copy and paste the command below into your terminal and it will create the training file mnist.py. This quick training script does the following:
- Import system and PyTorch libraries
- Find CUDA devices
- Setup the dataset, model, loss, and optimizer that will be used for training
- Run single epoch as a reusable functional unit for training + testing
- Iterate over multiple epochs
cat > ~/betty-mnist/mnist.py <<'PY'
#/usr/bin/env python3
# Import python system libraries
import os
# Import pytorch libraries
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchmetrics.functional import accuracy
# Setup CUDA as the device used - fallback to CPU just in case
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[INFO] torch {torch.__version__} | device: {device} | cuda_available={torch.cuda.is_available()}")
# Setup MNIST for downloading and training
data_dir = os.path.expanduser("~/betty-mnist/data")
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
train_ds = datasets.MNIST(root=data_dir, train=True, download=True, transform=transform)
test_ds = datasets.MNIST(root=data_dir, train=False, download=True, transform=transform)
# Setup training and testing data loaders. Pinning memory is helpful to optimize GPU transfers
pin = torch.cuda.is_available()
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=4, pin_memory=pin)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False, num_workers=4, pin_memory=pin)
# Create the model, loss, and optimizer that will be used for training
model = nn.Sequential(nn.Flatten(), nn.Linear(28*28,256), nn.ReLU(), nn.Linear(256,10)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Single epoch pass
def run_epoch(loader, train=True):
model.train(train)
total_loss = total_acc = total_count = 0
for x,y in loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
if train:
optimizer.zero_grad(set_to_none=True)
logits = model(x)
loss = criterion(logits, y)
if train:
loss.backward(); optimizer.step()
with torch.no_grad():
pred = torch.argmax(logits, dim=1)
total_acc += accuracy(pred, y, task="multiclass", num_classes=10).item() * x.size(0)
total_loss += loss.item() * x.size(0)
total_count += x.size(0)
return total_loss/total_count, total_acc/total_count
# Run for 4 epochs - testing between training
for epoch in range(1,4):
tr_loss, tr_acc = run_epoch(train_loader, True)
te_loss, te_acc = run_epoch(test_loader, False)
print(f"[E{epoch}] train: loss={tr_loss:.4f} acc={tr_acc:.4f} | test: loss={te_loss:.4f} acc={te_acc:.4f}")
# Save the output
save_path = os.path.expanduser("~/betty-mnist/mnist_linear.pt")
torch.save(model.state_dict(), save_path)
print(f"[INFO] saved model to {save_path}")
PY
Follow up: Look into using libraries like PyTorch Lightning so that you can focus on the models instead of the minutia. You can use them to scale to multiple nodes easily by following the Multi Node Training Tutorial.
5) Slurm batch script
A training job should always be submitted to the cluster with an sbatch script. This will provide you the most flexibility for running multiple experiments. Like before, copy and paste the following command to create the sbatch script mnist_gpu.sbatch. This script does the following:
- Configures your slurm request with #SBATCH lines (1 GPU, 14 CPUs, and 256GB RAM)
- Loads the conda environment that you just made
- Prints out hostname and nvidia-smi to provide some quick debug info if things go wrong
- Runs mnist.py
cat > ~/betty-mnist/mnist_gpu.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mnist-gpu
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:10:00
#SBATCH --partition=dgx-b200 # example GPU partition on Betty
#SBATCH --gpus=1
#SBATCH --cpus-per-task=14
#SBATCH --mem=256G
module load anaconda3
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"
hostname
nvidia-smi || true
python ~/betty-mnist/mnist.py
SB
Follow up: More SLURM details found in the SLURM Training
6) Submit & watch
Now you submit the sbatch script and wait for your job to be scheduled!
cd ~/betty-mnist
sbatch mnist_gpu.sbatch
squeue -u $USER
# after you see the JobID:
tail -f slurm-<JobID>.out
You should see 3 quick epochs and a saved model at ~/betty-mnist/mnist_linear.pt. Ctrl+C to exit the tail -f after it is done. You can check for the saved files to confirm that your checkpoint is there.
ls ~/betty-mnist
7) Quick re-run later
With everything setup, you can easily reuse the sbatch file at any point in time.
ssh <PennKey>@slurm_login.parcc.upenn.edu
module load anaconda3
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"
cd ~/betty-mnist
sbatch mnist_gpu.sbatch
Troubleshooting
conda: command not found→ You forgotmodule load anaconda3.conda activatesays “not a conda command” → Add thesource "$(conda info --base)/etc/profile.d/conda.sh"line first.- GPU run falls back to CPU → Ensure
--gpus=1and GPU partition; confirm env haspytorch-cudainstalled:conda list | grep pytorch-cuda - No MNIST download (egress restricted) → Run a short interactive job to prefetch:
salloc -t 5 -p genoa-std-mem --mem=4G --cpus-per-task=2
module load anaconda3; source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"
python -c "from torchvision import datasets, transforms; datasets.MNIST('~/betty-mnist/data', True, download=True, transform=transforms.ToTensor())"
exit