Zero to MPI
On This Page
From Zero to Simple MPI on Betty
0) Prereqs (faculty signup)
- PI creates a ColdFront project, requests compute and storage allocations, and adds members.
- After approval, you can SSH and submit jobs using the project’s Slurm account.
1) SSH to Betty
# Replace <PennKey> with your PennKey
ssh <PennKey>@slurm_login.parcc.upenn.edu
2) Make a workspace in your home
mkdir -p ~/betty-mpi
cd ~/betty-mpi
3) Load modules & (optional) create a Python MPI env
We’ll use the system OpenMPI module.
# every new session:
module load openmpi
4) Simple MPI programs
hello_mpi.c A simple hello world where from each Rank (process) we do a couple of common tasks.
- MPI_Init initializes the rank
- MPI_Comm_size and MPI_Comm_rank communicate with the other processes to start coodination
- printf sends from each rank to the log some information about itself
- MPI_Finalize deconstructs the rank before exiting the process
cat > ~/betty-mpi/hello_mpi.c <<'C'
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int world_size, world_rank, name_len;
char hostname[256];
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
gethostname(hostname, sizeof(hostname));
printf("Hello from rank %d of %d on %s\n", world_rank, world_size, hostname);
MPI_Finalize();
return 0;
}
C
pingpong.c A simple test utility to find your round trip latency.
cat > ~/betty-mpi/pingpong.c <<'C'
#include <mpi.h>
#include <stdio.h>
int main(int argc,char**argv){
MPI_Init(&argc,&argv);
int r,s; MPI_Comm_rank(MPI_COMM_WORLD,&r); MPI_Comm_size(MPI_COMM_WORLD,&s);
if(s<2){ if(!r) fprintf(stderr,"Run with >=2 ranks\n"); MPI_Abort(MPI_COMM_WORLD,1); }
char b='x'; const int N=10000; MPI_Barrier(MPI_COMM_WORLD);
double t0=MPI_Wtime();
for(int i=0;i<N;i++){
if(r==0){ MPI_Send(&b,1,MPI_CHAR,1,0,MPI_COMM_WORLD); MPI_Recv(&b,1,MPI_CHAR,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE); }
else if(r==1){ MPI_Recv(&b,1,MPI_CHAR,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE); MPI_Send(&b,1,MPI_CHAR,0,0,MPI_COMM_WORLD); }
}
double us=(MPI_Wtime()-t0)*1e6/N;
if(r==0) printf("Ping-pong round-trip ~ %.2f us over %d iters\n", us, N);
MPI_Finalize(); return 0;
}
C
5) Slurm scripts
In these slurm batch scripts, we will load openmpi (CPU nodes) or nvhpc (DGX nodes), configure environment variables, and then run hello_mpi and pingpong as examples. It is important to keep in mind the differences between the architectures being used. For this reason, these jobs are setup to compile just in time and tagged based on the type of node.
A) Genoa CPU nodes (2 nodes × 2 ranks each = 4 MPI ranks)
cat > ~/betty-mpi/mpi_genoa.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mpi-genoa
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:05:00
#SBATCH --partition=genoa-std-mem
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
module load openmpi/4.1.6
echo "Hosts in allocation:"; scontrol show hostnames $SLURM_JOB_NODELIST
echo "Version of mpicc being used"; which mpicc
echo ""
echo "Hello world"
echo ""
# C example
mpicc -O2 -march=x86-64-v2 -mtune=generic -o hello_mpi_genoa hello_mpi.c
srun --mpi=pmix_v3 ./hello_mpi_genoa
echo ""
echo "Ping pong"
echo ""
mpicc -O2 -march=x86-64-v2 -mtune=generic -o pingpong_genoa pingpong.c
srun --mpi=pmix_v3 ./pingpong_genoa
SB
A) DGX GPU nodes (2 nodes × 1 ranks each = 2 MPI ranks)
cat > ~/betty-mpi/mpi_dgx.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mpi-dgx
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:05:00
#SBATCH --partition=dgx-b200
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4G
module load nvhpc/25.5
echo "Current HPCX env:"; env | grep HPCX
# export UCX_LOG_LEVEL=info
export UCX_NET_DEVICES=mlx5_15:1,mlx5_10:1,mlx5_14:1,mlx5_13:1,mlx5_8:1,mlx5_7:1,mlx5_9:1,mlx5_4:1
export UCX_TLS=rc_x,sm,self
export OMPI_MCA_pml=ucx
ulimit -l unlimited
# TODO write a warning to never use mlx5_0 - mlx5_3 - these are internal devices
echo "Hosts in allocation:"; scontrol show hostnames $SLURM_JOB_NODELIST
echo ""
echo "Hello world"
echo ""
mpicc -O2 -o hello_mpi_dgx hello_mpi.c
srun --mpi=pmix ./hello_mpi_dgx
echo ""
echo "Ping pong"
echo ""
mpicc -O2 -o pingpong_dgx pingpong.c
srun --mpi=pmix ./pingpong_dgx
SB
6) Submit & watch
cd ~/betty-mpi
sbatch mpi_genoa.sbatch # CPU on Genoa
sbatch mpi_dgx.sbatch
squeue -u $USER
# after seeing your JobID:
tail -f slurm-<JobID>.out
You should see lines like:
Hello from rank 0 of 16 on genoa-5-21
Hello from rank 1 of 16 on genoa-5-21
...
Ping pong
Ping-pong round-trip ~ 0.25 us over 10000 iters
Make sure to confirm that the times from pingpong are under 10 microseconds.
7) Fabric interfaces
On the DGX nodes, you want to specifically select the compute fabric interfaces and the transport stacks being used.
export UCX_NET_DEVICES=mlx5_15:1,mlx5_10:1,mlx5_14:1,mlx5_13:1,mlx5_8:1,mlx5_7:1,mlx5_9:1,mlx5_4:1
export UCX_TLS=rc_x,sm,self
export OMPI_MCA_pml=ucx
8) Quick re-run later
ssh <PennKey>@slurm_login.parcc.upenn.edu
cd ~/betty-mpi
sbatch mpi_genoa.sbatch
sbatch mpi_dgx.sbatch
Troubleshooting
orterun/mpirun not foundormpicc: command not found
You forgotmodule load openmpi.- Partition/account errors
Confirm your Slurm account:sacctmgr show user $USER withassoc | awk 'NR>2 && $2!=""{print $2}' | sort -u