Run the WRF Weather Forecasting Model with Fluid Numerics' Slurm-GCP

1. Introduction

Continental US

Last Updated: 2021-05-05

What you will build

In this codelab, you are going to deploy an auto-scaling High Performance Computing (HPC) cluster on Google Cloud with the Slurm job scheduler. You will use an example Terraform deployment that deploys this cluster with WRF® installed via Spack. Then, you will use this infrastructure to run the CONUS 2.5km benchmark or the CONUS 12km benchmark.

What you will learn

  • How to configure Identity and Access Management (IAM) policies for operating an HPC cluster on Google Cloud Platform
  • How to deploy a cloud-native HPC cluster with the Slurm job scheduler
  • How to run WRF® in parallel on Google Cloud using a Slurm batch job

What you will need

2. Configuration

Enable Google Cloud APIs

To create and use Google Cloud resources APIs must be enabled.

gcloud services enable compute.googleapis.com 

Set IAM Policies

In HPC, there are clear distinctions between system administrators and system users. System administrators generally have "root access" enabling them to manage and operate compute resources. System users are generally researchers, scientists, and application engineers who only need to use the resources to execute jobs.

On Google Cloud, the OS Login API provisions POSIX user information from Google Workspace, Cloud Identity, and Gmail accounts. Additionally, OS Login integrates with GCP's Identity and Access Management (IAM) system to determine if users should be allowed to escalate privileges on Linux systems.

In this tutorial, we assume you are filling the system administrator and compute engine administrator roles. We will configure IAM policies to give you sufficient permissions to accomplish the following tasks

  • Create/Delete Google Compute Engine (GCE) VM instances
  • SSH into GCE VM instances

57883cb8acc09653.png

To give yourself the necessary IAM roles to complete this tutorial, in the Google Cloud Console:

  1. Navigate to IAM & Admin > IAM in the Products and Services menu.
  2. Click "+Add" near the top of the page.
  3. Type in your Google Workspace account, Cloud Identity Account, or Gmail account under "New members"
  4. Add the following roles : Compute Admin, Compute OS Login, and Service Account User
  5. Click Save

Your login now has the permissions required to initiate the creation of the HPC cluster.

To verify you have assigned the correct roles, open your Cloud Shell, and run the following command, replacing YOUR_PROJECT and EMAIL_ADDRESS with your project and email address.

$ gcloud projects get-iam-policy YOUR_PROJECT --flatten="bindings[].members" --format='table(bindings.role)' --filter="bindings.members=user:EMAIL_ADDRESS"

This command will yield the output:

ROLE
roles/compute.osLogin
roles/iam.serviceAccountUser
roles/compute.admin

3. Low Quota: Deploy an auto-scaling HPC cluster with Terraform

In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler. This is identical to the High Quota option, except that the machine type used is smaller and the number of vCPUs used is smaller.

  1. Open your Cloud Shell on GCP.
  2. Clone the FluidNumerics/slurm-gcp repository
cd ~
git clone https://github.com/FluidNumerics/slurm-gcp.git
  1. Change to the WRF directory:
cd  ~/slurm-gcp/tf/examples/wrf
  1. Create and review a terraform plan. Set the environment variables WRF_NAME, WRF_PROJECT, and WRF_ZONE to specify the name of your cluster, your GCP project, and the zone you want to deploy to.
export WRF_PROJECT=<PROJECT ID>
export WRF_ZONE=<ZONE>
export WRF_NAME="wrf-small"
  1. The first time you run terraform you must run the init command:
terraform init
  1. Create the plan with the make command, which will run terraform
make plan
  1. Deploy the cluster. The installation and setup process can take up to 2 hours. During the deployment, WRF and all of its dependencies will be installed.
make apply
  1. SSH to the login node created in the previous step. You can see this node in the previous step (probably called wrf-small-login0). You can do this by clicking on the SSH button next to the list of VM Instances in the console menu item Compute Engine -> VM instance.

Option: This pair of gcloud commands will figure out the login node name and SSH into it:

export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${WRF_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1)

gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${WRF_ZONE}

  1. Once you are connected to the login node, to verify your cluster setup, check that the wrf module is available.
$ module load gcc && module load openmpi && module avail
-------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/openmpi/4.0.5-eagetxh/gcc/9.2.0 --------------------------------------
   hdf5/1.10.7    netcdf-c/4.7.4    netcdf-fortran/4.5.3    parallel-netcdf/1.12.1    wrf/4.2

------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/gcc/9.2.0 -------------------------------------------------
   hwloc/2.2.0      libiconv/1.16          libpng/1.6.37     nasm/2.15.05            openmpi/4.0.5 (L,D)    time/1.9              zlib/1.2.11
   jasper/2.0.16    libjpeg-turbo/2.0.4    libtirpc/1.2.6    ncurses/5.9.20130511    perl/5.16.3            util-macros/1.19.1
   krb5/1.15.1      libpciaccess/0.16      libxml2/2.9.10    numactl/2.0.14          tcsh/6.22.02           xz/5.2.2

--------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/Core ----------------------------------------------------
   gcc/9.2.0 (L)

---------------------------------------------------------------------- /apps/modulefiles ----------------------------------------------------------------------
   openmpi/v4.1.x
  1. Verify that /apps/share/conus-12km has the contents listed below.
$  ls -1 /apps/share/conus-12km/
FILE:2018-06-17_00
FILE:2018-06-17_03
FILE:2018-06-17_06
FILE:2018-06-17_09
FILE:2018-06-17_12
geo_em.d01.nc
geogrid.log
met_em.d01.2018-06-17_00:00:00.nc
met_em.d01.2018-06-17_03:00:00.nc
met_em.d01.2018-06-17_06:00:00.nc
met_em.d01.2018-06-17_09:00:00.nc
met_em.d01.2018-06-17_12:00:00.nc
metgrid.log
namelist.input
namelist.wps
ungrib.log
wrfbdy_d01
wrfinput_d01

4. Run the CONUS 12km Benchmark

To run the CONUS 12km benchmark, you will submit a Slurm batch job. The input decks for this benchmark are included in the wrf-gcp VM image under /apps/share/benchmarks/conus-12km.

For this section, you must be SSH connected to the login node of the cluster

  1. Copy the example wrf-conus.sh batch file from /apps/share
cp /apps/share/wrf-conus12.sh ~/
  1. Open the wrf-conus.sh in a text editor to verify the --partition and --ntasks are set correctly. The number of tasks should be set to the number of MPI ranks you want to use to launch the job. For this demonstration, the number of tasks is equivalent to the number of vCPUs used for the job, and should not exceed your available quota.
#!/bin/bash
#SBATCH --partition=wrf
#SBATCH --ntasks=24
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=2g
#SBATCH --cpus-per-task=1
#SBATCH --account=default
#
# /////////////////////////////////////////////// #

WORK_PATH=${HOME}/wrf-benchmark/
SRUN_FLAGS="-n $SLURM_NTASKS --cpu-bind=threads"

. /apps/share/spack.sh
module load gcc/9.2.0
module load openmpi
module load hdf5 netcdf-c netcdf-fortran wrf

mkdir -p ${WORK_PATH}
cd ${WORK_PATH}
ln -s ${INSTALL_ROOT}/share/conus-12km/* .
ln -s $(spack location -i wrf)/run/* .

srun $MPI_FLAGS ./wrf.exe
  1. Submit the batch job using sbatch.
sbatch wrf-conus12.sh
  1. Wait for the job to complete. This benchmark is configured to run a 6-hour forecast, which takes about 3 hours to complete with 24 ranks. You can monitor the status of your job with squeue.
  2. When the job completes, check the contents of rsl.out.0000 to verify that you see the statement "wrf: SUCCESS COMPLETE WRF". The numeric suffix will be different if you've run the job more than once, e.g., you got a config setting wrong and had to rerun it.
$ tail -n1 ${HOME}/wrf-benchmark/rsl.out.0000
d01 2018-06-17_06:00:00 wrf: SUCCESS COMPLETE WRF

5. High Quota: Deploy an auto-scaling HPC cluster with Terraform

In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler in GCP.

  1. Open your Cloud Shell on GCP.
  2. Clone the FluidNumerics/slurm-gcp repository
cd ~
git clone https://github.com/FluidNumerics/slurm-gcp.git
  1. Change to the WRF directory:
cd  ~/slurm-gcp/tf/examples/wrf
  1. Create and review a terraform plan. Set the environment variables WRF_NAME, WRF_PROJECT, WRF_ZONE, WRF_MAX_NODE, and WRF_MACHINE_TYPE to specify the name of your cluster, your GCP project, the zone you want to deploy to, the max number of nodes, and the machine type. For the CONUS 2.5km benchmark, we recommend using c2-standard-60 instances with at least 8 nodes available run jobs with 480 MPI ranks.
export WRF_PROJECT=<PROJECT ID>
export WRF_ZONE=<ZONE>
export WRF_NAME=wrf-large
export WRF_MAX_NODE=5
export WRF_MACHINE_TYPE="c2-standard-60"
  1. If you did not do it above, you must run terraform init to start up terraform:
terraform init
  1. Create the plan with the make command.
make plan
  1. Deploy the cluster. The installation and setup process can take up to 2 hours. During the deployment, WRF and all of its dependencies will be installed.
make apply
  1. SSH to the login node created in the previous step. You can see this node in the previous step (probably called wrf-large-login0). You can do this by clicking on the SSH button next to the list of VM Instances in the console menu item Compute Engine -> VM instance.

Option: This pair of gcloud commands will figure out the login node name and SSH into it:

export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${WRF_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1)

gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${WRF_ZONE}

The second command should result in you being connected to the Slurm Login node.

  1. Once you are connected to the login node, to verify your cluster setup, check that the wrf module is available.
$ module load gcc && module load openmpi && module avail
-------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/openmpi/4.0.5-eagetxh/gcc/9.2.0 --------------------------------------
   hdf5/1.10.7    netcdf-c/4.7.4    netcdf-fortran/4.5.3    parallel-netcdf/1.12.1    wrf/4.2

------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/gcc/9.2.0 -------------------------------------------------
   hwloc/2.2.0      libiconv/1.16          libpng/1.6.37     nasm/2.15.05            openmpi/4.0.5 (L,D)    time/1.9              zlib/1.2.11
   jasper/2.0.16    libjpeg-turbo/2.0.4    libtirpc/1.2.6    ncurses/5.9.20130511    perl/5.16.3            util-macros/1.19.1
   krb5/1.15.1      libpciaccess/0.16      libxml2/2.9.10    numactl/2.0.14          tcsh/6.22.02           xz/5.2.2

--------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/Core ----------------------------------------------------
   gcc/9.2.0 (L)

---------------------------------------------------------------------- /apps/modulefiles ----------------------------------------------------------------------
   openmpi/v4.1.x
  1. Verify that /apps/share/conus-2.5km has the contents listed below.
$ ls -1 /apps/share/conus-2.5km
FILE:2018-06-17_00
FILE:2018-06-17_03
FILE:2018-06-17_06
FILE:2018-06-17_09
FILE:2018-06-17_12
geo_em.d01.nc
geogrid.log
gfs.0p25.2018061700.f000.grib2
gfs.0p25.2018061700.f003.grib2
gfs.0p25.2018061700.f006.grib2
gfs.0p25.2018061700.f009.grib2
gfs.0p25.2018061700.f012.grib2
met_em.d01.2018-06-17_00:00:00.nc
met_em.d01.2018-06-17_03:00:00.nc
met_em.d01.2018-06-17_06:00:00.nc
met_em.d01.2018-06-17_09:00:00.nc
met_em.d01.2018-06-17_12:00:00.nc
metgrid.log
namelist.input
namelist.wps
ungrib.log
wrfbdy_d01
wrfinput_d01

6. Run the CONUS 2.5km Benchmark

To run the CONUS 2.5km benchmark, you will submit a Slurm batch job. The input decks for this benchmark are included in the wrf-gcp VM image under /apps/share/benchmarks/conus-2.5km.

For this section, you must be SSH connected to the login node of the cluster

  1. Copy the example wrf-conus.sh batch file from /apps/share
cp /apps/share/wrf-conus2p5.sh ~/
  1. Open the wrf-conus.sh in a text editor to verify the --partition and --ntasks are set correctly. The partition should be set to c2-60. The number of tasks should be set to the number of MPI ranks you want to use to launch the job. For this demonstration, the number of tasks is equivalent to the number of vCPUs used for the job, and should not exceed your available quota.
#!/bin/bash
#SBATCH --partition=c2-60
#SBATCH --ntasks=480
#SBATCH --ntasks-per-node=60
#SBATCH --mem-per-cpu=2g
#SBATCH --cpus-per-task=1
#SBATCH --account=default
#
# /////////////////////////////////////////////// #

WORK_PATH=${HOME}/wrf-benchmark/
SRUN_FLAGS="-n $SLURM_NTASKS --cpu-bind=threads"

. /apps/share/spack.sh
module load gcc/9.2.0
module load openmpi
module load hdf5 netcdf-c netcdf-fortran wrf

mkdir -p ${WORK_PATH}
cd ${WORK_PATH}
ln -s ${INSTALL_ROOT}/share/conus-2.5km/* .
ln -s $(spack location -i wrf)/run/* .

srun $MPI_FLAGS ./wrf.exe
  1. Submit the batch job using sbatch.
sbatch wrf-conus2p5.sh
  1. Wait for the job to complete. This benchmark is configured to run a 6-hour forecast, which takes about 1 hour to complete with 480 ranks. You can monitor the status of your job with squeue.
  2. When the job completes, check the contents of rsl.out.0000 to verify that you see the statement "wrf: SUCCESS COMPLETE WRF". The numeric suffix will be different if you've run the job more than once, e.g., you got a config setting wrong and had to rerun it.
$ tail -n1 ${HOME}/wrf-benchmark/rsl.out.0000
d01 2018-06-17_06:00:00 wrf: SUCCESS COMPLETE WRF

7. Congratulations

In this codelab, you created an auto-scaling, cloud-native HPC cluster and ran a parallel WRF® simulation on Google Cloud Platform!

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab:

Delete the project

The easiest way to eliminate billing is to delete the project you created for the codelab.

Caution: Deleting a project has the following effects:

  • Everything in the project is deleted. If you used an existing project for this codelab, when you delete it, you also delete any other work you've done in the project.
  • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple codelabs and quickstarts, reusing projects can help you avoid exceeding project quota limits.

  1. In the Cloud Console, go to the Manage resources page. Go to the Manage resources page
  2. In the project list, select the project that you want to delete and then click Delete dc096e8341a05fec.png.
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Delete the individual resources

  1. Open your cloud shell and navigate to the wrf example directory
cd  ~/slurm-gcp/tf/examples/wrf
  1. Run make destroy to delete all of the resources.
make destroy