Skip to content

Zero to MPI

From Zero to Simple MPI on Betty

0) Prereqs (faculty signup)

  • PI creates a ColdFront project, requests compute and storage allocations, and adds members.
  • After approval, you can SSH and submit jobs using the project’s Slurm account.

1) SSH to Betty

# Replace <PennKey> with your PennKey
ssh <PennKey>@slurm_login.parcc.upenn.edu

2) Make a workspace in your home

mkdir -p ~/betty-mpi
cd ~/betty-mpi

3) Load modules & (optional) create a Python MPI env

We’ll use the system OpenMPI module.

# every new session:
module load openmpi

4) Simple MPI programs

hello_mpi.c A simple hello world where from each Rank (process) we do a couple of common tasks.

  • MPI_Init initializes the rank
  • MPI_Comm_size and MPI_Comm_rank communicate with the other processes to start coodination
  • printf sends from each rank to the log some information about itself
  • MPI_Finalize deconstructs the rank before exiting the process
cat > ~/betty-mpi/hello_mpi.c <<'C'
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int world_size, world_rank, name_len;
    char hostname[256];
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    gethostname(hostname, sizeof(hostname));
    printf("Hello from rank %d of %d on %s\n", world_rank, world_size, hostname);
    MPI_Finalize();
    return 0;
}
C

pingpong.c A simple test utility to find your round trip latency.

cat > ~/betty-mpi/pingpong.c <<'C'

#include <mpi.h>
#include <stdio.h>
int main(int argc,char**argv){
  MPI_Init(&argc,&argv);
  int r,s; MPI_Comm_rank(MPI_COMM_WORLD,&r); MPI_Comm_size(MPI_COMM_WORLD,&s);
  if(s<2){ if(!r) fprintf(stderr,"Run with >=2 ranks\n"); MPI_Abort(MPI_COMM_WORLD,1); }
  char b='x'; const int N=10000; MPI_Barrier(MPI_COMM_WORLD);
  double t0=MPI_Wtime();
  for(int i=0;i<N;i++){
    if(r==0){ MPI_Send(&b,1,MPI_CHAR,1,0,MPI_COMM_WORLD); MPI_Recv(&b,1,MPI_CHAR,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE); }
    else if(r==1){ MPI_Recv(&b,1,MPI_CHAR,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE); MPI_Send(&b,1,MPI_CHAR,0,0,MPI_COMM_WORLD); }
  }
  double us=(MPI_Wtime()-t0)*1e6/N;
  if(r==0) printf("Ping-pong round-trip ~ %.2f us over %d iters\n", us, N);
  MPI_Finalize(); return 0;
}
C

5) Slurm scripts

In these slurm batch scripts, we will load openmpi (CPU nodes) or nvhpc (DGX nodes), configure environment variables, and then run hello_mpi and pingpong as examples. It is important to keep in mind the differences between the architectures being used. For this reason, these jobs are setup to compile just in time and tagged based on the type of node.

A) Genoa CPU nodes (2 nodes × 2 ranks each = 4 MPI ranks)

cat > ~/betty-mpi/mpi_genoa.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mpi-genoa
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:05:00
#SBATCH --partition=genoa-std-mem
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G

module load openmpi/4.1.6

echo "Hosts in allocation:"; scontrol show hostnames $SLURM_JOB_NODELIST
echo "Version of mpicc being used"; which mpicc

echo ""
echo "Hello world"
echo ""
# C example
mpicc -O2 -march=x86-64-v2 -mtune=generic -o hello_mpi_genoa hello_mpi.c
srun --mpi=pmix_v3 ./hello_mpi_genoa

echo ""
echo "Ping pong"
echo ""
mpicc -O2 -march=x86-64-v2 -mtune=generic -o pingpong_genoa pingpong.c
srun --mpi=pmix_v3 ./pingpong_genoa

SB

A) DGX GPU nodes (2 nodes × 1 ranks each = 2 MPI ranks)

cat > ~/betty-mpi/mpi_dgx.sbatch <<'SB'
#!/bin/bash
#SBATCH --job-name=mpi-dgx
#SBATCH --output=slurm-%j.out
#SBATCH --time=00:05:00
#SBATCH --partition=dgx-b200
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4G

module load nvhpc/25.5

echo "Current HPCX env:"; env | grep HPCX

# export UCX_LOG_LEVEL=info
export UCX_NET_DEVICES=mlx5_15:1,mlx5_10:1,mlx5_14:1,mlx5_13:1,mlx5_8:1,mlx5_7:1,mlx5_9:1,mlx5_4:1
export UCX_TLS=rc_x,sm,self
export OMPI_MCA_pml=ucx
ulimit -l unlimited

# TODO write a warning to never use mlx5_0 - mlx5_3   -  these are internal devices

echo "Hosts in allocation:"; scontrol show hostnames $SLURM_JOB_NODELIST

echo ""
echo "Hello world"
echo ""
mpicc -O2 -o hello_mpi_dgx hello_mpi.c
srun --mpi=pmix ./hello_mpi_dgx

echo ""
echo "Ping pong"
echo ""
mpicc -O2 -o pingpong_dgx pingpong.c
srun --mpi=pmix ./pingpong_dgx

SB

6) Submit & watch

cd ~/betty-mpi
sbatch mpi_genoa.sbatch   # CPU on Genoa
sbatch mpi_dgx.sbatch

squeue -u $USER
# after seeing your JobID:
tail -f slurm-<JobID>.out

You should see lines like:

Hello from rank 0 of 16 on genoa-5-21
Hello from rank 1 of 16 on genoa-5-21
...

Ping pong

Ping-pong round-trip ~ 0.25 us over 10000 iters

Make sure to confirm that the times from pingpong are under 10 microseconds.


7) Fabric interfaces

On the DGX nodes, you want to specifically select the compute fabric interfaces and the transport stacks being used.

export UCX_NET_DEVICES=mlx5_15:1,mlx5_10:1,mlx5_14:1,mlx5_13:1,mlx5_8:1,mlx5_7:1,mlx5_9:1,mlx5_4:1
export UCX_TLS=rc_x,sm,self
export OMPI_MCA_pml=ucx

8) Quick re-run later

ssh <PennKey>@slurm_login.parcc.upenn.edu
cd ~/betty-mpi
sbatch mpi_genoa.sbatch
sbatch mpi_dgx.sbatch

Troubleshooting

  • orterun/mpirun not found or mpicc: command not found
    You forgot module load openmpi.
  • Partition/account errors
    Confirm your Slurm account:
    sacctmgr show user $USER withassoc | awk 'NR>2 && $2!=""{print $2}' | sort -u