Skip to content

SLURM

SLURM (Simple Linux Utility for Resource Management) is one of the workload managers used to allocate computing resources and schedule jobs on the Betty system. Whether you’re running a GPU-based deep learning workload or a memory-intensive genomics pipeline, SLURM ensures that your job is executed on appropriate hardware while managing resource sharing across all users.

At its core, SLURM lets users:

  • Request compute resources (CPUs, GPUs, memory, etc.)
  • Schedule and run jobs (batch or interactive)
  • Monitor and manage job progress

You interact with SLURM using commands such as:

  • srun: run a command with SLURM-managed resources (can be interactive)
  • sbatch: submit a job script for batch execution
  • squeue: view job queue
  • scancel: cancel jobs

Betty Queues (Partitions)

A partition in SLURM is a logical grouping of compute nodes. Each partition corresponds to a different type of hardware available on the PARCC infrastructure. Users select the appropriate partition using --partition=<name> in their job scripts.

See the Betty overview for information about the hardware itself.

Partition NameDescriptionResource Limit per SLURM Account
dgx-b200Access to NVIDIA DGX B200 nodes with 8 B200 GPUs each (for AI/ML workloads)32 GPUs
dgx-b200-mig90Access to a MIG instance with 90GB of VRAM – useful for medium GPU sized tasks8 MIGs
dgx-b200-mig45Access to a MIG instance with 45GB of VRAM – useful for small GPU sized tasks8 MIGs
genoa-std-memAMD EPYC Genoa CPU nodes with standard memory (general CPU jobs)640 CPU
genoa-large-memGenoa nodes with increased RAM capacity (for high-memory applications)128 CPU

When to Use Each Partition

  • Use dgx-b200 when you require GPU acceleration, especially for training large AI models.
  • Use genoa-std-mem for general-purpose CPU-only workloads that don’t need excessive memory.
  • Use genoa-large-mem when your job exceeds 256 GB of RAM, such as large data preprocessing, genome assembly, or large graph workloads.

Hello World SLURM

This is the first command you should run to make sure slurm is working for your account

srun --ntasks=1 --cpus-per-task=1 --mem=1G hostname

You will see the hostname printed out onto the command line.

If this ever stops working, please reach out to PARCC.

squeue

squeue is the primary SLURM command to check the status of jobs in the queue — whether they are pending, running, or recently completed.

Basic usage

squeue -u $USER

This shows all jobs submitted by you.

Output example

JOBID   PARTITION        NAME     USER    ST  TIME  NODES  NODELIST(REASON)
123456  genoa-std-mem    myjob    chaneyk PD   0:00      1  (Resources)
123457  dgx-b200         train    chaneyk  R   5:22      1  dgx003

Understanding the Columns

ColumnMeaning
JOBIDUnique ID for the job
PARTITIONQueue the job is assigned to
NAMEJob name from #SBATCH --job-name
USERUsername of the job submitter
ST (State)Current state of the job
TIMETime since the job started (or 0:00 if pending)
NODESNumber of nodes requested
NODELIST(REASON)Node running the job, or reason for pending

Common Job States (ST)

CodeDescription
PDPending (waiting for resources)
RRunning
CGCompleting (cleaning up)
FFailed
CDCompleted
TOTimed out

To understand why a job is pending, look at the NODELIST(REASON) column. Common reasons:

  • (Resources): Waiting for hardware (CPUs, GPUs, memory)
  • (Priority): Waiting due to job prioritization
  • (AssocGrpCpuLimit): You’ve hit a user/account CPU limit

srun

The srun command is useful when you want an interactive shell with access to cluster resources (CPUs, memory, GPUs) — for debugging, testing code, or running lightweight interactive jobs like Jupyter or Python REPLs.

Example: Interactive Job on a Genoa CPU Node

srun \
  --partition=genoa-std-mem \         # Choose the partition (queue) to run in
  --ntasks=1 \                        # Run a single task (like one process)
  --cpus-per-task=4 \                 # Request 4 CPU cores for that task
  --mem=16G \                         # Request 16 GB of RAM
  --time=01:00:00 \                   # Set a time limit of 1 hour
  --pty bash                          # Start an interactive Bash shell

Once this command runs and a node is allocated, you’ll be placed into a shell on the assigned compute node with the requested resources.

What You Can Modify

OptionPurposeExample
--partitionSelect queue based on hardwaredgx-b200, genoa-large-mem
--cpus-per-taskIncrease if your job uses multithreading--cpus-per-task=8
--memAdjust based on your data needs--mem=64G
--timeBe conservative but realistic--time=00:15:00 for short tests
--gpusUse only if running on dgx-b200--gpus=2

sbatch

The sbatch command is used to submit a job script to the SLURM scheduler. Unlike srun, which runs interactively, sbatch runs your job in the background on a compute node once resources become available.

Example: Submitting a CPU-Only Job Script

Save the following as my_job.sh:

#!/bin/bash
#SBATCH --job-name=my_analysis          # Name of the job
#SBATCH --output=logs/%x_%j.out         # Stdout goes to logs/jobname_jobid.out
#SBATCH --error=logs/%x_%j.err          # Stderr goes to logs/jobname_jobid.err
#SBATCH --partition=genoa-std-mem       # Queue to submit to
#SBATCH --ntasks=1                      # Number of tasks (usually one per process)
#SBATCH --cpus-per-task=4               # Number of CPU cores per task
#SBATCH --mem=16G                       # Memory allocation
#SBATCH --time=01:00:00                 # Maximum runtime (hh:mm:ss)

module load python/3.10
python run_analysis.py --input data.csv

Then submit with:

sbatch my_job.sh

Monitoring sbatch

As you utilize batch jobs, you will not be able to watch the status directly. Here are some of the tools you can use to monitor your progress.

Job ID Confirmation

When you submit a job:

$ sbatch my_job.sh
Submitted batch job 123456

This returns a Job ID (e.g., 123456). Keep this number — you’ll use it to track your job.

Check Job Status

Use squeue to view the job’s current state:

squeue -u $USER

Look at the ST (state) column for information on the current progress through the slurm system.

To get info for a specific job:

scontrol show job <job_id>

Check Output and Error Logs

If your script includes:

#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

Check the logs/ directory for files like:

  • my_analysis_123456.out – standard output
  • my_analysis_123456.err – error messages

These will tell you:

  • Whether your script started successfully
  • Any runtime errors or tracebacks
  • Output of print() or echo statements

Resource Usage (after completion)

Once your job finishes, check its resource usage:

sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,MaxVMSize,AllocCPUs,ReqMem

This tells you:

  • How long your job ran
  • How much memory was used (MaxRSS)
  • Whether you used your requested CPUs effectively

Tip: If MaxRSS is close to or exceeds ReqMem, your job may have been killed for OOM (out-of-memory).

Common Follow-Ups

Job stuck in PD (Pending)?
Check with:

scontrol show job <job_id>

Look for Reason= — it may say Resources, QOSMaxCpuPerUser, etc.

Job failed quickly?
Check the .err log for Python errors, module issues, or file path problems.

scancel

Sometimes you may need to stop a job — maybe you submitted it with the wrong parameters, found a bug, or no longer need the computation. SLURM provides the scancel command to terminate jobs manually.

Basic Usage

To cancel a specific job:

scancel <job_id>

For example:

scancel 123456

This will immediately remove the job from the queue (if pending) or terminate it (if running).

Cancel All Your Jobs

To cancel all jobs you’ve submitted:

scancel -u $USER

This is especially useful if you’ve submitted multiple incorrect jobs or want to clear your queue.

Cancel by Job Name or State

  • Cancel all pending jobs:
scancel -u $USER --state=PENDING
  • Cancel jobs by name:
scancel --name=my_analysis

This cancels all jobs with the given name — useful if you’re batch-submitting test jobs.

Cancel a Job Array

If you submitted a job array:

sbatch --array=1-10 my_array_job.sh

You can cancel the entire array:

scancel <array_job_id>

Or cancel just one subtask:

scancel <array_job_id>_<task_id>

Example:

scancel 123456_7

Tips and Warnings

Always double-check the job ID before cancelling, especially when using scancel -u $USER.

If a job is stuck in CG (completing) or TO (timed out), scancel may not help — contact support if it persists.

Use squeue to confirm whether the job has been removed from the queue.