Zero to MNIST

Who this is for

New Betty users (especially faculty and their groups) who want a minimal, working GPU training example using MNIST with Slurm.

Faculty (PI) must complete ColdFront Training
User access – your PI will have to add you to their ColdFront project

Tip: If you can’t submit Slurm jobs later, you’re probably missing an association to a Slurm account. Ask your PI to add you to their ColdFront Project allocation.

1) SSH into Betty

# Replace <PennKey> with your PennKey username
kinit <PennKey>@UPENN.EDU
ssh <PennKey>@slurm_login.parcc.upenn.edu

Follow up: Checkout Logging-In to learn more about options that you have.

2) Make a workspace in your home folder

We’ll put everything under ~/betty-mnist to keep it tidy.

mkdir -p ~/betty-mnist
cd ~/betty-mnist

3) Load Conda from the module & create your env in home

We’ll place the env at $HOME/envs/betty-mnist so it’s clearly user-owned. For future projects, you can leave this environment here or have one in a lab project folder to be shared.

Take a careful note about the version of cuda that is being requested. The cluster only supports CUDA 12.8 currently.

# every new session
module load anaconda3

# enable `conda activate` in non-interactive shells
source "$(conda info --base)/etc/profile.d/conda.sh"
mkdir -p "$HOME/envs"

# --- GPU (when you have GPU allocation) ---
# (Make sure your job requests GPUs; see the sbatch below.)
conda create -y -p "$HOME/envs/betty-mnist" python=3.11 uv -c conda-forge
conda activate "$HOME/envs/betty-mnist"
uv pip install torch torchvision torchmetrics --index-url https://download.pytorch.org/whl/cu128

Follow up: Checkout MAMBA or other methods for managing your software with our Software Tutorials

4) MNIST training script

This is a simple MNIST training script that handles loading data to GPU with a basic network. PyTorch tutorials will be best for exploring the construction of this example. You can copy and paste the command below into your terminal and it will create the training file mnist.py. This quick training script does the following:

Import system and PyTorch libraries
Find CUDA devices
Setup the dataset, model, loss, and optimizer that will be used for training
Run single epoch as a reusable functional unit for training + testing
Iterate over multiple epochs

cat > ~/betty-mnist/mnist.py <<'PY'
#/usr/bin/env python3

# Import python system libraries
import os

# Import pytorch libraries
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchmetrics.functional import accuracy

# Setup CUDA as the device used - fallback to CPU just in case
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"[INFO] torch {torch.__version__} | device: {device} | cuda_available={torch.cuda.is_available()}")

# Setup MNIST for downloading and training
data_dir = os.path.expanduser("~/betty-mnist/data")
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.1307,), (0.3081,))])

train_ds = datasets.MNIST(root=data_dir, train=True,  download=True, transform=transform)
test_ds  = datasets.MNIST(root=data_dir, train=False, download=True, transform=transform)

# Setup training and testing data loaders. Pinning memory is helpful to optimize GPU transfers
pin = torch.cuda.is_available()
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True,  num_workers=4, pin_memory=pin)
test_loader  = DataLoader(test_ds,  batch_size=256, shuffle=False, num_workers=4, pin_memory=pin)

# Create the model, loss, and optimizer that will be used for training
model = nn.Sequential(nn.Flatten(), nn.Linear(28*28,256), nn.ReLU(), nn.Linear(256,10)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Single epoch pass
def run_epoch(loader, train=True):
    model.train(train)
    total_loss = total_acc = total_count = 0
    for x,y in loader:
        x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
        if train:
            optimizer.zero_grad(set_to_none=True)
        logits = model(x)
        loss = criterion(logits, y)
        if train:
            loss.backward(); optimizer.step()
        with torch.no_grad():
            pred = torch.argmax(logits, dim=1)
            total_acc += accuracy(pred, y, task="multiclass", num_classes=10).item() * x.size(0)
        total_loss += loss.item() * x.size(0)
        total_count += x.size(0)
    return total_loss/total_count, total_acc/total_count

# Run for 4 epochs - testing between training
for epoch in range(1,4):
    tr_loss, tr_acc = run_epoch(train_loader, True)
    te_loss, te_acc = run_epoch(test_loader,  False)
    print(f"[E{epoch}] train: loss={tr_loss:.4f} acc={tr_acc:.4f} | test: loss={te_loss:.4f} acc={te_acc:.4f}")

# Save the output
save_path = os.path.expanduser("~/betty-mnist/mnist_linear.pt")
torch.save(model.state_dict(), save_path)
print(f"[INFO] saved model to {save_path}")
PY

Follow up: Look into using libraries like PyTorch Lightning so that you can focus on the models instead of the minutia. You can use them to scale to multiple nodes easily by following the Multi Node Training Tutorial.

5) Slurm batch script

A training job should always be submitted to the cluster with an sbatch script. This will provide you the most flexibility for running multiple experiments. Like before, copy and paste the following command to create the sbatch script mnist_gpu.sbatch. This script does the following:

Configures your slurm request with #SBATCH lines (1 GPU, 14 CPUs, and 256GB RAM)
Loads the conda environment that you just made
Prints out hostname and nvidia-smi to provide some quick debug info if things go wrong
Runs mnist.py

cat > ~/betty-mnist/mnist_gpu.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mnist-gpu
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:10:00
#SBATCH --partition=dgx-b200          # example GPU partition on Betty
#SBATCH --gpus=1
#SBATCH --cpus-per-task=14
#SBATCH --mem=256G

module load anaconda3
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"

hostname
nvidia-smi || true
python ~/betty-mnist/mnist.py
SB

Follow up: More SLURM details found in the SLURM Training

6) Submit & watch

Now you submit the sbatch script and wait for your job to be scheduled!

cd ~/betty-mnist
sbatch mnist_gpu.sbatch
squeue -u $USER
# after you see the JobID:
tail -f slurm-<JobID>.out

You should see 3 quick epochs and a saved model at ~/betty-mnist/mnist_linear.pt. Ctrl+C to exit the tail -f after it is done. You can check for the saved files to confirm that your checkpoint is there.

ls ~/betty-mnist

7) Quick re-run later

With everything setup, you can easily reuse the sbatch file at any point in time.

ssh <PennKey>@slurm_login.parcc.upenn.edu
module load anaconda3
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"
cd ~/betty-mnist
sbatch mnist_gpu.sbatch

Troubleshooting

conda: command not found → You forgot module load anaconda3.
conda activate says “not a conda command” → Add the source "$(conda info --base)/etc/profile.d/conda.sh" line first.
GPU run falls back to CPU → Ensure --gpus=1 and GPU partition; confirm env has pytorch-cuda installed:
conda list | grep pytorch-cuda
No MNIST download (egress restricted) → Run a short interactive job to prefetch:

salloc -t 5 -p genoa-std-mem --mem=4G --cpus-per-task=2
module load anaconda3; source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$HOME/envs/betty-mnist"
python -c "from torchvision import datasets, transforms; datasets.MNIST('~/betty-mnist/data', True, download=True, transform=transforms.ToTensor())"
exit

On This Page