Slurm

Slurm is one of the workload managers used to allocate computing resources and schedule jobs on the Betty system. Whether you’re running a GPU-based deep learning workload or a memory-intensive genomics pipeline, SLURM ensures that your job is executed on appropriate hardware while managing resource sharing across all users.

At its core, SLURM lets users:

Request compute resources (CPUs, GPUs, memory, etc.)
Schedule and run jobs (batch or interactive)
Monitor and manage job progress

You interact with SLURM using commands such as:

srun: run a command with SLURM-managed resources (can be interactive)
sbatch: submit a job script for batch execution
squeue: view job queue
scancel: cancel jobs

Betty Queues (Partitions)

A partition in Slurm is a logical grouping of compute nodes. Each partition corresponds to a different type of hardware available on the PARCC infrastructure. Users select the appropriate partition using --partition=<name> in their job scripts.

See the Betty overview for information about the hardware itself.

Partition Name	Description	Resource Limit per SLURM Account
dgx-b200	Access to NVIDIA DGX B200 nodes with 8 B200 GPUs each (for AI/ML workloads)	32 GPUs
dgx-b200-mig90	Access to a MIG instance with 90GB of VRAM – useful for medium GPU sized tasks	8 MIGs
dgx-b200-mig45	Access to a MIG instance with 45GB of VRAM – useful for small GPU sized tasks	8 MIGs
genoa-std-mem	AMD EPYC Genoa CPU nodes with standard memory (general CPU jobs)	640 CPU
genoa-large-mem	Genoa nodes with increased RAM capacity (for high-memory applications)	128 CPU

When to Use Each Partition

Use dgx-b200 when you require GPU acceleration, especially for training large AI models.
Use genoa-std-mem for general-purpose CPU-only workloads that don’t need excessive memory.
Use genoa-large-mem when your job exceeds 256 GB of RAM, such as large data preprocessing, genome assembly, or large graph workloads.

Hello World Slurm

This is the first command you should run to make sure slurm is working for your account

srun --ntasks=1 --cpus-per-task=1 --mem=1G hostname

You will see the hostname printed out onto the command line.

If this ever stops working, please reach out to PARCC.

`squeue`

squeue is the primary Slurm command to check the status of jobs in the queue — whether they are pending, running, or recently completed.

Basic usage

squeue -u $USER

This shows all jobs submitted by you.

Output example

JOBID   PARTITION        NAME     USER    ST  TIME  NODES  NODELIST(REASON)
123456  genoa-std-mem    myjob    chaneyk PD   0:00      1  (Resources)
123457  dgx-b200         train    chaneyk  R   5:22      1  dgx003

Understanding the Columns

Column	Meaning
JOBID	Unique ID for the job
PARTITION	Queue the job is assigned to
NAME	Job name from `#SBATCH --job-name`
USER	Username of the job submitter
`ST` (State)	Current state of the job
TIME	Time since the job started (or `0:00` if pending)
NODES	Number of nodes requested
`NODELIST(REASON)`	Node running the job, or reason for pending

Common Job States (`ST`)

Code	Description
PD	Pending (waiting for resources)
R	Running
CG	Completing (cleaning up)
F	Failed
CD	Completed
TO	Timed out

To understand why a job is pending, look at the NODELIST(REASON) column. Common reasons:

(Resources): Waiting for hardware (CPUs, GPUs, memory)
(Priority): Waiting due to job prioritization
(AssocGrpCpuLimit): You’ve hit a user/account CPU limit

`srun`

The srun command is useful when you want an interactive shell with access to cluster resources (CPUs, memory, GPUs) — for debugging, testing code, or running lightweight interactive jobs like Jupyter or Python REPLs.

Example: Interactive Job on a Genoa CPU Node

srun \
  --partition=genoa-std-mem \         # Choose the partition (queue) to run in
  --ntasks=1 \                        # Run a single task (like one process)
  --cpus-per-task=4 \                 # Request 4 CPU cores for that task
  --mem=16G \                         # Request 16 GB of RAM
  --time=01:00:00 \                   # Set a time limit of 1 hour
  --pty bash                          # Start an interactive Bash shell

Once this command runs and a node is allocated, you’ll be placed into a shell on the assigned compute node with the requested resources.

What You Can Modify

Option	Purpose	Example
`--partition`	Select queue based on hardware	`dgx-b200`, `genoa-large-mem`
`--cpus-per-task`	Increase if your job uses multithreading	`--cpus-per-task`=8
`--mem`	Adjust based on your data needs	`--mem=64G`
`--time`	Be conservative but realistic	`--time=00:15:00` for short tests
`--gpus`	Use only if running on `dgx-b200`	`--gpus=`2

`sbatch`

The sbatch command is used to submit a job script to the Slurm scheduler. Unlike srun, which runs interactively, sbatch runs your job in the background on a compute node once resources become available.

Example: Submitting a CPU-Only Job Script

Save the following as my_job.sh:

#!/bin/bash
#SBATCH --job-name=my_analysis          # Name of the job
#SBATCH --output=logs/%x_%j.out         # Stdout goes to logs/jobname_jobid.out
#SBATCH --error=logs/%x_%j.err          # Stderr goes to logs/jobname_jobid.err
#SBATCH --partition=genoa-std-mem       # Queue to submit to
#SBATCH --ntasks=1                      # Number of tasks (usually one per process)
#SBATCH --cpus-per-task=4               # Number of CPU cores per task
#SBATCH --mem=16G                       # Memory allocation
#SBATCH --time=01:00:00                 # Maximum runtime (hh:mm:ss)

module load python/3.10
python run_analysis.py --input data.csv

Then submit with:

sbatch my_job.sh

Monitoring sbatch

As you utilize batch jobs, you will not be able to watch the status directly. Here are some of the tools you can use to monitor your progress.

Job ID Confirmation

When you submit a job:

$ sbatch my_job.sh
Submitted batch job 123456

This returns a Job ID (e.g., 123456). Keep this number — you’ll use it to track your job.

Check Job Status

Use squeue to view the job’s current state:

squeue -u $USER

Look at the ST (state) column for information on the current progress through the slurm system.

To get info for a specific job:

scontrol show job <job_id>

Check Output and Error Logs

If your script includes:

#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

Check the logs/ directory for files like:

my_analysis_123456.out – standard output
my_analysis_123456.err – error messages

These will tell you:

Whether your script started successfully
Any runtime errors or tracebacks
Output of print() or echo statements

Resource Usage (after completion)

Once your job finishes, check its resource usage:

sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,MaxVMSize,AllocCPUs,ReqMem

This tells you:

How long your job ran
How much memory was used (MaxRSS)
Whether you used your requested CPUs effectively

Tip: If MaxRSS is close to or exceeds ReqMem, your job may have been killed for OOM (out-of-memory).

Common Follow-Ups

Job stuck in PD (Pending)?
Check with:

scontrol show job <job_id>

Look for Reason= — it may say Resources, QOSMaxCpuPerUser, etc.

Job failed quickly?
Check the .err log for Python errors, module issues, or file path problems.

`scancel`

Sometimes you may need to stop a job — maybe you submitted it with the wrong parameters, found a bug, or no longer need the computation. Slurm provides the scancel command to terminate jobs manually.

Basic Usage

To cancel a specific job:

scancel <job_id>

For example:

scancel 123456

This will immediately remove the job from the queue (if pending) or terminate it (if running).

Cancel All Your Jobs

To cancel all jobs you’ve submitted:

scancel -u $USER

This is especially useful if you’ve submitted multiple incorrect jobs or want to clear your queue.

Cancel by Job Name or State

Cancel all pending jobs:

scancel -u $USER --state=PENDING

Cancel jobs by name:

scancel --name=my_analysis

This cancels all jobs with the given name — useful if you’re batch-submitting test jobs.

Cancel a Job Array

If you submitted a job array:

sbatch --array=1-10 my_array_job.sh

You can cancel the entire array:

scancel <array_job_id>

Or cancel just one subtask:

scancel <array_job_id>_<task_id>

Example:

scancel 123456_7

Tips and Warnings

Always double-check the job ID before cancelling, especially when using scancel -u $USER.

If a job is stuck in CG (completing) or TO (timed out), scancel may not help — contact support if it persists.

Use squeue to confirm whether the job has been removed from the queue.

On This Page