SLURM
On This Page
SLURM (Simple Linux Utility for Resource Management) is one of the workload managers used to allocate computing resources and schedule jobs on the Betty system. Whether you’re running a GPU-based deep learning workload or a memory-intensive genomics pipeline, SLURM ensures that your job is executed on appropriate hardware while managing resource sharing across all users.
At its core, SLURM lets users:
- Request compute resources (CPUs, GPUs, memory, etc.)
- Schedule and run jobs (batch or interactive)
- Monitor and manage job progress
You interact with SLURM using commands such as:
srun: run a command with SLURM-managed resources (can be interactive)sbatch: submit a job script for batch executionsqueue: view job queuescancel: cancel jobs
Betty Queues (Partitions)
A partition in SLURM is a logical grouping of compute nodes. Each partition corresponds to a different type of hardware available on the PARCC infrastructure. Users select the appropriate partition using --partition=<name> in their job scripts.
See the Betty overview for information about the hardware itself.
| Partition Name | Description | Resource Limit per SLURM Account |
| dgx-b200 | Access to NVIDIA DGX B200 nodes with 8 B200 GPUs each (for AI/ML workloads) | 32 GPUs |
| dgx-b200-mig90 | Access to a MIG instance with 90GB of VRAM – useful for medium GPU sized tasks | 8 MIGs |
| dgx-b200-mig45 | Access to a MIG instance with 45GB of VRAM – useful for small GPU sized tasks | 8 MIGs |
| genoa-std-mem | AMD EPYC Genoa CPU nodes with standard memory (general CPU jobs) | 640 CPU |
| genoa-large-mem | Genoa nodes with increased RAM capacity (for high-memory applications) | 128 CPU |
When to Use Each Partition
- Use
dgx-b200when you require GPU acceleration, especially for training large AI models. - Use
genoa-std-memfor general-purpose CPU-only workloads that don’t need excessive memory. - Use
genoa-large-memwhen your job exceeds 256 GB of RAM, such as large data preprocessing, genome assembly, or large graph workloads.
Hello World SLURM
This is the first command you should run to make sure slurm is working for your account
srun --ntasks=1 --cpus-per-task=1 --mem=1G hostname
You will see the hostname printed out onto the command line.
If this ever stops working, please reach out to PARCC.
squeue
squeue is the primary SLURM command to check the status of jobs in the queue — whether they are pending, running, or recently completed.
Basic usage
squeue -u $USER
This shows all jobs submitted by you.
Output example
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 genoa-std-mem myjob chaneyk PD 0:00 1 (Resources)
123457 dgx-b200 train chaneyk R 5:22 1 dgx003
Understanding the Columns
| Column | Meaning |
| JOBID | Unique ID for the job |
| PARTITION | Queue the job is assigned to |
| NAME | Job name from #SBATCH --job-name |
| USER | Username of the job submitter |
ST (State) | Current state of the job |
| TIME | Time since the job started (or 0:00 if pending) |
| NODES | Number of nodes requested |
NODELIST(REASON) | Node running the job, or reason for pending |
Common Job States (ST)
| Code | Description |
| PD | Pending (waiting for resources) |
| R | Running |
| CG | Completing (cleaning up) |
| F | Failed |
| CD | Completed |
| TO | Timed out |
To understand why a job is pending, look at the NODELIST(REASON) column. Common reasons:
(Resources): Waiting for hardware (CPUs, GPUs, memory)(Priority): Waiting due to job prioritization(AssocGrpCpuLimit): You’ve hit a user/account CPU limit
srun
The srun command is useful when you want an interactive shell with access to cluster resources (CPUs, memory, GPUs) — for debugging, testing code, or running lightweight interactive jobs like Jupyter or Python REPLs.
Example: Interactive Job on a Genoa CPU Node
srun \
--partition=genoa-std-mem \ # Choose the partition (queue) to run in
--ntasks=1 \ # Run a single task (like one process)
--cpus-per-task=4 \ # Request 4 CPU cores for that task
--mem=16G \ # Request 16 GB of RAM
--time=01:00:00 \ # Set a time limit of 1 hour
--pty bash # Start an interactive Bash shell
Once this command runs and a node is allocated, you’ll be placed into a shell on the assigned compute node with the requested resources.
What You Can Modify
| Option | Purpose | Example |
--partition | Select queue based on hardware | dgx-b200, genoa-large-mem |
--cpus-per-task | Increase if your job uses multithreading | --cpus-per-task=8 |
--mem | Adjust based on your data needs | --mem=64G |
--time | Be conservative but realistic | --time=00:15:00 for short tests |
--gpus | Use only if running on dgx-b200 | --gpus=2 |
sbatch
The sbatch command is used to submit a job script to the SLURM scheduler. Unlike srun, which runs interactively, sbatch runs your job in the background on a compute node once resources become available.
Example: Submitting a CPU-Only Job Script
Save the following as my_job.sh:
#!/bin/bash
#SBATCH --job-name=my_analysis # Name of the job
#SBATCH --output=logs/%x_%j.out # Stdout goes to logs/jobname_jobid.out
#SBATCH --error=logs/%x_%j.err # Stderr goes to logs/jobname_jobid.err
#SBATCH --partition=genoa-std-mem # Queue to submit to
#SBATCH --ntasks=1 # Number of tasks (usually one per process)
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --mem=16G # Memory allocation
#SBATCH --time=01:00:00 # Maximum runtime (hh:mm:ss)
module load python/3.10
python run_analysis.py --input data.csv
Then submit with:
sbatch my_job.sh
Monitoring sbatch
As you utilize batch jobs, you will not be able to watch the status directly. Here are some of the tools you can use to monitor your progress.
Job ID Confirmation
When you submit a job:
$ sbatch my_job.sh
Submitted batch job 123456
This returns a Job ID (e.g., 123456). Keep this number — you’ll use it to track your job.
Check Job Status
Use squeue to view the job’s current state:
squeue -u $USER
Look at the ST (state) column for information on the current progress through the slurm system.
To get info for a specific job:
scontrol show job <job_id>
Check Output and Error Logs
If your script includes:
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
Check the logs/ directory for files like:
my_analysis_123456.out– standard outputmy_analysis_123456.err– error messages
These will tell you:
- Whether your script started successfully
- Any runtime errors or tracebacks
- Output of
print()orechostatements
Resource Usage (after completion)
Once your job finishes, check its resource usage:
sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,MaxVMSize,AllocCPUs,ReqMem
This tells you:
- How long your job ran
- How much memory was used (MaxRSS)
- Whether you used your requested CPUs effectively
Tip: If
MaxRSSis close to or exceedsReqMem, your job may have been killed for OOM (out-of-memory).
Common Follow-Ups
Job stuck in PD (Pending)?
Check with:
scontrol show job <job_id>
Look for Reason= — it may say Resources, QOSMaxCpuPerUser, etc.
Job failed quickly?
Check the .err log for Python errors, module issues, or file path problems.
scancel
Sometimes you may need to stop a job — maybe you submitted it with the wrong parameters, found a bug, or no longer need the computation. SLURM provides the scancel command to terminate jobs manually.
Basic Usage
To cancel a specific job:
scancel <job_id>
For example:
scancel 123456
This will immediately remove the job from the queue (if pending) or terminate it (if running).
Cancel All Your Jobs
To cancel all jobs you’ve submitted:
scancel -u $USER
This is especially useful if you’ve submitted multiple incorrect jobs or want to clear your queue.
Cancel by Job Name or State
- Cancel all pending jobs:
scancel -u $USER --state=PENDING
- Cancel jobs by name:
scancel --name=my_analysis
This cancels all jobs with the given name — useful if you’re batch-submitting test jobs.
Cancel a Job Array
If you submitted a job array:
sbatch --array=1-10 my_array_job.sh
You can cancel the entire array:
scancel <array_job_id>
Or cancel just one subtask:
scancel <array_job_id>_<task_id>
Example:
scancel 123456_7
Tips and Warnings
Always double-check the job ID before cancelling, especially when using scancel -u $USER.
If a job is stuck in CG (completing) or TO (timed out), scancel may not help — contact support if it persists.
Use squeue to confirm whether the job has been removed from the queue.