.. role:: zero Running ======= The set of MPI tasks running the RRTMG schemes on CPUs and the set of MPI tasks running the rest of the model on GPUs each use their own MPI intracommunicator, and except for the periodic exchange of model state or tendencies (as described in the :ref:`lagged radiation` section), these sets of MPI tasks work independently. Accordingly, two mesh partition files are required to run the model, and two log files are written. Environment setup ----------------- The number of MPI ranks to assign to the GPU and CPU intracommunicators is determined at run time by environment variables. Before running the ``atmosphere_model`` executable, the following two environment variables must first be set: * ``MPAS_DYNAMICS_RANKS_PER_NODE`` -- the number of MPI ranks per node that will execute the model on GPUs. * ``MPAS_RADIATION_RANKS_PER_NODE`` -- the number of MPI ranks per node that will run the RRTMG radiation schemes on CPUs. At present, both of these environment variables must be set to even integers, since the code assumes that each node has two sockets, across which CPU and GPU tasks should be evenly distributed. The sum of *MPAS_DYNAMICS_RANKS_PER_NODE* and *MPAS_RADIATION_RANKS_PER_NODE* should equal the total number of MPI ranks running on each node as determined by your job script. Mesh partition files -------------------- In general, if the model is to be run across *n* nodes, then two mesh partition files will be required by the model: one file with (*n* |times| *MPAS_DYNAMICS_RANKS_PER_NODE*) partitions and another file with (*n* |times| *MPAS_RADIATION_RANKS_PER_NODE*) partitions. .. |times| unicode:: U+00D7 .. X Log files --------- In the standard release of MPAS-Atmosphere, all MPI ranks participate in all physics and dynamics computation, and only MPI rank :zero:`0` writes out a log file by default. When different partitions of the global MPI communicator peform different roles -- either radiation computation on CPUs, or non-radiation physics and dynamics on GPUs -- MPI rank :zero:`0` from the intracommunicators used by each of the two roles will create a log file. The ``log.atmosphere.role01.0000.out`` file contains log messages from MPI rank :zero:`0` of the intracommunicator running on GPUs, while the ``log.atmosphere.role02.0000.out`` file contains log messages from MPI rank :zero:`0` of the intracommunicator running on CPUs. ORNL Summit ----------- On the ORNL Summit_ machine, the following job script may be used as a starting point: .. code-block:: bash #!/bin/bash #BSUB -o %J.out #BSUB -e %J.err #BSUB -nnodes 2 #BSUB -alloc_flags "gpumps smt1" #BSUB -P XXXXX #BSUB -J MPAS_GPU #BSUB -q batch #BSUB -W 5 export OMP_NUM_THREADS=1 export OMP_STACKSIZE=64M export PAMI_ENABLE_STRIPING=0 export PAMI_IBV_ENABLE_OOO_AR=1 export PAMI_IBV_QP_SERVICE_LEVEL=8 export PAMI_IBV_ADAPTER_AFFINITY=1 export PAMI_IBV_DEVICE_NAME="mlx5_0:1" export PAMI_IBV_DEVICE_NAME_1="mlx5_3:1" export MPAS_DYNAMICS_RANKS_PER_NODE=24 export MPAS_RADIATION_RANKS_PER_NODE=16 jsrun --smpiargs="-gpu" --stdio_mode=prepended --nrs 2 --cpu_per_rs 42 --gpu_per_rs 6 -d plane:40 --bind packed:1 --np 80 ./unset.sh ./atmosphere_model The ``unset.sh`` script referenced in the ``jsrun`` command contains the following: .. code-block:: bash #!/bin/bash unset CUDA_VISIBLE_DEVICES unset CUDA_VISIBLE_DEVICES0 unset CUDA_VISIBLE_DEVICES1 unset CUDA_VISIBLE_DEVICES2 unset CUDA_VISIBLE_DEVICES3 unset CUDA_VISIBLE_DEVICES4 unset CUDA_VISIBLE_DEVICES5 unset CUDA_VISIBLE_DEVICES6 unset CUDA_VISIBLE_DEVICES7 unset CUDA_VISIBLE_DEVICES8 unset CUDA_VISIBLE_DEVICES9 unset CUDA_VISIBLE_DEVICES10 unset CUDA_VISIBLE_DEVICES11 unset CUDA_VISIBLE_DEVICES12 unset CUDA_VISIBLE_DEVICES13 unset CUDA_VISIBLE_DEVICES14 unset CUDA_VISIBLE_DEVICES15 unset CUDA_VISIBLE_DEVICES16 unset CUDA_VISIBLE_DEVICES17 unset CUDA_VISIBLE_DEVICES18 unset CUDA_VISIBLE_DEVICES19 unset CUDA_VISIBLE_DEVICES20 unset CUDA_VISIBLE_DEVICES21 unset CUDA_VISIBLE_DEVICES22 unset CUDA_VISIBLE_DEVICES23 unset CUDA_VISIBLE_DEVICES24 unset CUDA_VISIBLE_DEVICES25 unset CUDA_VISIBLE_DEVICES26 unset CUDA_VISIBLE_DEVICES27 unset CUDA_VISIBLE_DEVICES28 unset CUDA_VISIBLE_DEVICES29 unset CUDA_VISIBLE_DEVICES30 unset CUDA_VISIBLE_DEVICES31 unset CUDA_VISIBLE_DEVICES32 unset CUDA_VISIBLE_DEVICES33 unset CUDA_VISIBLE_DEVICES34 unset CUDA_VISIBLE_DEVICES35 unset CUDA_VISIBLE_DEVICES36 unset CUDA_VISIBLE_DEVICES37 unset CUDA_VISIBLE_DEVICES38 unset CUDA_VISIBLE_DEVICES39 export PAMI_ENABLE_STRIPING=0 export PAMI_IBV_ENABLE_OOO_AR=1 export PAMI_IBV_QP_SERVICE_LEVEL=8 export PAMI_IBV_ADAPTER_AFFINITY=1 let available_cpus=168 let socket1_start=88 lrank=${OMPI_COMM_WORLD_LOCAL_RANK} lsize=${OMPI_COMM_WORLD_LOCAL_SIZE} let x2rank=2*$lrank let socket=$x2rank/$lsize let cpus_per_rank=$available_cpus/$lsize let num_threads=$OMP_NUM_THREADS let stride=$cpus_per_rank/$num_threads if [ "$socket" = 0 ]; then export PAMI_IBV_DEVICE_NAME=mlx5_0:1 let start_cpu=$lrank*$cpus_per_rank let stop_cpu=$start_cpu+$cpus_per_rank-1 else export PAMI_IBV_DEVICE_NAME=mlx5_3:1 let half_size=$lsize/2 let srank=$lrank-$half_size let start_cpu=$socket1_start+$srank*$cpus_per_rank let stop_cpu=$start_cpu+$cpus_per_rank-1 fi export GOMP_CPU_AFFINITY=$start_cpu-$stop_cpu:$stride exec taskset -c $start_cpu-$stop_cpu "$@" .. _Summit: https://www.olcf.ornl.gov/summit/