Running

The set of MPI tasks running the RRTMG schemes on CPUs and the set of MPI tasks running the rest of the model on GPUs each use their own MPI intracommunicator, and except for the periodic exchange of model state or tendencies (as described in the lagged radiation section), these sets of MPI tasks work independently. Accordingly, two mesh partition files are required to run the model, and two log files are written.

Environment setup

The number of MPI ranks to assign to the GPU and CPU intracommunicators is determined at run time by environment variables. Before running the atmosphere_model executable, the following two environment variables must first be set:

  • MPAS_DYNAMICS_RANKS_PER_NODE – the number of MPI ranks per node that will execute the model on GPUs.

  • MPAS_RADIATION_RANKS_PER_NODE – the number of MPI ranks per node that will run the RRTMG radiation schemes on CPUs.

At present, both of these environment variables must be set to even integers, since the code assumes that each node has two sockets, across which CPU and GPU tasks should be evenly distributed. The sum of MPAS_DYNAMICS_RANKS_PER_NODE and MPAS_RADIATION_RANKS_PER_NODE should equal the total number of MPI ranks running on each node as determined by your job script.

Mesh partition files

In general, if the model is to be run across n nodes, then two mesh partition files will be required by the model: one file with (n × MPAS_DYNAMICS_RANKS_PER_NODE) partitions and another file with (n × MPAS_RADIATION_RANKS_PER_NODE) partitions.

Log files

In the standard release of MPAS-Atmosphere, all MPI ranks participate in all physics and dynamics computation, and only MPI rank 0 writes out a log file by default. When different partitions of the global MPI communicator peform different roles – either radiation computation on CPUs, or non-radiation physics and dynamics on GPUs – MPI rank 0 from the intracommunicators used by each of the two roles will create a log file.

The log.atmosphere.role01.0000.out file contains log messages from MPI rank 0 of the intracommunicator running on GPUs, while the log.atmosphere.role02.0000.out file contains log messages from MPI rank 0 of the intracommunicator running on CPUs.

ORNL Summit

On the ORNL Summit machine, the following job script may be used as a starting point:

#!/bin/bash
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -nnodes 2
#BSUB -alloc_flags "gpumps smt1"
#BSUB -P XXXXX
#BSUB -J MPAS_GPU
#BSUB -q batch
#BSUB -W 5

export OMP_NUM_THREADS=1
export OMP_STACKSIZE=64M

export PAMI_ENABLE_STRIPING=0
export PAMI_IBV_ENABLE_OOO_AR=1
export PAMI_IBV_QP_SERVICE_LEVEL=8
export PAMI_IBV_ADAPTER_AFFINITY=1
export PAMI_IBV_DEVICE_NAME="mlx5_0:1"
export PAMI_IBV_DEVICE_NAME_1="mlx5_3:1"

export MPAS_DYNAMICS_RANKS_PER_NODE=24
export MPAS_RADIATION_RANKS_PER_NODE=16

jsrun --smpiargs="-gpu" --stdio_mode=prepended --nrs 2 --cpu_per_rs 42  --gpu_per_rs 6  -d plane:40 --bind packed:1 --np 80  ./unset.sh ./atmosphere_model

The unset.sh script referenced in the jsrun command contains the following:

#!/bin/bash

unset CUDA_VISIBLE_DEVICES

unset CUDA_VISIBLE_DEVICES0
unset CUDA_VISIBLE_DEVICES1
unset CUDA_VISIBLE_DEVICES2
unset CUDA_VISIBLE_DEVICES3
unset CUDA_VISIBLE_DEVICES4
unset CUDA_VISIBLE_DEVICES5
unset CUDA_VISIBLE_DEVICES6
unset CUDA_VISIBLE_DEVICES7
unset CUDA_VISIBLE_DEVICES8
unset CUDA_VISIBLE_DEVICES9
unset CUDA_VISIBLE_DEVICES10
unset CUDA_VISIBLE_DEVICES11
unset CUDA_VISIBLE_DEVICES12
unset CUDA_VISIBLE_DEVICES13
unset CUDA_VISIBLE_DEVICES14
unset CUDA_VISIBLE_DEVICES15
unset CUDA_VISIBLE_DEVICES16
unset CUDA_VISIBLE_DEVICES17
unset CUDA_VISIBLE_DEVICES18
unset CUDA_VISIBLE_DEVICES19
unset CUDA_VISIBLE_DEVICES20
unset CUDA_VISIBLE_DEVICES21
unset CUDA_VISIBLE_DEVICES22
unset CUDA_VISIBLE_DEVICES23
unset CUDA_VISIBLE_DEVICES24
unset CUDA_VISIBLE_DEVICES25
unset CUDA_VISIBLE_DEVICES26
unset CUDA_VISIBLE_DEVICES27
unset CUDA_VISIBLE_DEVICES28
unset CUDA_VISIBLE_DEVICES29
unset CUDA_VISIBLE_DEVICES30
unset CUDA_VISIBLE_DEVICES31
unset CUDA_VISIBLE_DEVICES32
unset CUDA_VISIBLE_DEVICES33
unset CUDA_VISIBLE_DEVICES34
unset CUDA_VISIBLE_DEVICES35
unset CUDA_VISIBLE_DEVICES36
unset CUDA_VISIBLE_DEVICES37
unset CUDA_VISIBLE_DEVICES38
unset CUDA_VISIBLE_DEVICES39

export PAMI_ENABLE_STRIPING=0
export PAMI_IBV_ENABLE_OOO_AR=1
export PAMI_IBV_QP_SERVICE_LEVEL=8
export PAMI_IBV_ADAPTER_AFFINITY=1

let available_cpus=168
let socket1_start=88

lrank=${OMPI_COMM_WORLD_LOCAL_RANK}
lsize=${OMPI_COMM_WORLD_LOCAL_SIZE}
let x2rank=2*$lrank
let socket=$x2rank/$lsize
let cpus_per_rank=$available_cpus/$lsize

let num_threads=$OMP_NUM_THREADS
let stride=$cpus_per_rank/$num_threads

if [ "$socket" = 0 ]; then
  export PAMI_IBV_DEVICE_NAME=mlx5_0:1
  let start_cpu=$lrank*$cpus_per_rank
  let stop_cpu=$start_cpu+$cpus_per_rank-1
else
  export PAMI_IBV_DEVICE_NAME=mlx5_3:1
  let half_size=$lsize/2
  let srank=$lrank-$half_size
  let start_cpu=$socket1_start+$srank*$cpus_per_rank
  let stop_cpu=$start_cpu+$cpus_per_rank-1
fi

export GOMP_CPU_AFFINITY=$start_cpu-$stop_cpu:$stride

exec taskset -c $start_cpu-$stop_cpu "$@"