Run the WRF Weather Forecasting Model with Fluid Numerics' Slurm-GCP

25 mins remaining

About this codelab

Last updated May 9, 2022

Written by Joseph Schoonover, Ross Thomson, Wyatt Gorman

1. Introduction

Continental US

Last Updated: 2021-05-05

What you will build

In this codelab, you are going to deploy an auto-scaling High Performance Computing (HPC) cluster on Google Cloud with the Slurm job scheduler. You will use an example Terraform deployment that deploys this cluster with WRF® installed via Spack. Then, you will use this infrastructure to run the CONUS 2.5km benchmark or the CONUS 12km benchmark.

What you will learn

How to configure Identity and Access Management (IAM) policies for operating an HPC cluster on Google Cloud Platform
How to deploy a cloud-native HPC cluster with the Slurm job scheduler
How to run WRF® in parallel on Google Cloud using a Slurm batch job

What you will need

Gmail Account with an SSH key attached, or Google Workspace, Cloud Identity
Google Cloud Platform Project with Billing enabled
Project owner role on your GCP Project
Sufficient Compute Engine Quota (480 c2 vCPUs and 500 GB PD-Standard Disk)

2. Configuration

Enable Google Cloud APIs

To create and use Google Cloud resources APIs must be enabled.

gcloud services enable compute.googleapis.com

Set IAM Policies

In HPC, there are clear distinctions between system administrators and system users. System administrators generally have "root access" enabling them to manage and operate compute resources. System users are generally researchers, scientists, and application engineers who only need to use the resources to execute jobs.

On Google Cloud, the OS Login API provisions POSIX user information from Google Workspace, Cloud Identity, and Gmail accounts. Additionally, OS Login integrates with GCP's Identity and Access Management (IAM) system to determine if users should be allowed to escalate privileges on Linux systems.

In this tutorial, we assume you are filling the system administrator and compute engine administrator roles. We will configure IAM policies to give you sufficient permissions to accomplish the following tasks

Create/Delete Google Compute Engine (GCE) VM instances
SSH into GCE VM instances

To give yourself the necessary IAM roles to complete this tutorial, in the Google Cloud Console:

Navigate to IAM & Admin > IAM in the Products and Services menu.
Click "+Add" near the top of the page.
Type in your Google Workspace account, Cloud Identity Account, or Gmail account under "New members"
Add the following roles : Compute Admin, Compute OS Login, and Service Account User
Click Save

Your login now has the permissions required to initiate the creation of the HPC cluster.

To verify you have assigned the correct roles, open your Cloud Shell, and run the following command, replacing YOUR_PROJECT and EMAIL_ADDRESS with your project and email address.

$ gcloud projects get-iam-policy YOUR_PROJECT --flatten="bindings[].members" --format='table(bindings.role)' --filter="bindings.members=user:EMAIL_ADDRESS"

This command will yield the output:

ROLE
roles/compute.osLogin
roles/iam.serviceAccountUser
roles/compute.admin

3. Low Quota: Deploy an auto-scaling HPC cluster with Terraform

In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler. This is identical to the High Quota option, except that the machine type used is smaller and the number of vCPUs used is smaller.

Open your Cloud Shell on GCP.
Clone the FluidNumerics/slurm-gcp repository

cd ~
git clone https://github.com/FluidNumerics/slurm-gcp.git

Change to the WRF directory:

cd  ~/slurm-gcp/tf/examples/wrf

Create and review a terraform plan. Set the environment variables WRF_NAME, WRF_PROJECT, and WRF_ZONE to specify the name of your cluster, your GCP project, and the zone you want to deploy to.

export WRF_PROJECT=<PROJECT ID>
export WRF_ZONE=<ZONE>
export WRF_NAME="wrf-small"

The first time you run terraform you must run the init command:

terraform init

Create the plan with the make command, which will run terraform

make plan

Deploy the cluster. The installation and setup process can take up to 2 hours. During the deployment, WRF and all of its dependencies will be installed.

make apply

SSH to the login node created in the previous step. You can see this node in the previous step (probably called wrf-small-login0). You can do this by clicking on the SSH button next to the list of VM Instances in the console menu item Compute Engine -> VM instance.

Option: This pair of gcloud commands will figure out the login node name and SSH into it:

export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${WRF_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1)

gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${WRF_ZONE}

Once you are connected to the login node, to verify your cluster setup, check that the wrf module is available.

$ module load gcc && module load openmpi && module avail
-------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/openmpi/4.0.5-eagetxh/gcc/9.2.0 --------------------------------------
   hdf5/1.10.7    netcdf-c/4.7.4    netcdf-fortran/4.5.3    parallel-netcdf/1.12.1    wrf/4.2

------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/gcc/9.2.0 -------------------------------------------------
   hwloc/2.2.0      libiconv/1.16          libpng/1.6.37     nasm/2.15.05            openmpi/4.0.5 (L,D)    time/1.9              zlib/1.2.11
   jasper/2.0.16    libjpeg-turbo/2.0.4    libtirpc/1.2.6    ncurses/5.9.20130511    perl/5.16.3            util-macros/1.19.1
   krb5/1.15.1      libpciaccess/0.16      libxml2/2.9.10    numactl/2.0.14          tcsh/6.22.02           xz/5.2.2

--------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/Core ----------------------------------------------------
   gcc/9.2.0 (L)

---------------------------------------------------------------------- /apps/modulefiles ----------------------------------------------------------------------
   openmpi/v4.1.x

Verify that /apps/share/conus-12km has the contents listed below.

$  ls -1 /apps/share/conus-12km/
FILE:2018-06-17_00
FILE:2018-06-17_03
FILE:2018-06-17_06
FILE:2018-06-17_09
FILE:2018-06-17_12
geo_em.d01.nc
geogrid.log
met_em.d01.2018-06-17_00:00:00.nc
met_em.d01.2018-06-17_03:00:00.nc
met_em.d01.2018-06-17_06:00:00.nc
met_em.d01.2018-06-17_09:00:00.nc
met_em.d01.2018-06-17_12:00:00.nc
metgrid.log
namelist.input
namelist.wps
ungrib.log
wrfbdy_d01
wrfinput_d01

4. Run the CONUS 12km Benchmark

To run the CONUS 12km benchmark, you will submit a Slurm batch job. The input decks for this benchmark are included in the wrf-gcp VM image under /apps/share/benchmarks/conus-12km.

For this section, you must be SSH connected to the login node of the cluster

Copy the example wrf-conus.sh batch file from /apps/share

cp /apps/share/wrf-conus12.sh ~/

Open the wrf-conus.sh in a text editor to verify the --partition and --ntasks are set correctly. The number of tasks should be set to the number of MPI ranks you want to use to launch the job. For this demonstration, the number of tasks is equivalent to the number of vCPUs used for the job, and should not exceed your available quota.

#!/bin/bash
#SBATCH --partition=wrf
#SBATCH --ntasks=24
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=2g
#SBATCH --cpus-per-task=1
#SBATCH --account=default
#
# /////////////////////////////////////////////// #

WORK_PATH=${HOME}/wrf-benchmark/
SRUN_FLAGS="-n $SLURM_NTASKS --cpu-bind=threads"

. /apps/share/spack.sh
module load gcc/9.2.0
module load openmpi
module load hdf5 netcdf-c netcdf-fortran wrf

mkdir -p ${WORK_PATH}
cd ${WORK_PATH}
ln -s ${INSTALL_ROOT}/share/conus-12km/* .
ln -s $(spack location -i wrf)/run/* .

srun $MPI_FLAGS ./wrf.exe

Submit the batch job using sbatch.

sbatch wrf-conus12.sh

Wait for the job to complete. This benchmark is configured to run a 6-hour forecast, which takes about 3 hours to complete with 24 ranks. You can monitor the status of your job with squeue.
When the job completes, check the contents of rsl.out.0000 to verify that you see the statement "wrf: SUCCESS COMPLETE WRF". The numeric suffix will be different if you've run the job more than once, e.g., you got a config setting wrong and had to rerun it.

$ tail -n1 ${HOME}/wrf-benchmark/rsl.out.0000
d01 2018-06-17_06:00:00 wrf: SUCCESS COMPLETE WRF

5. High Quota: Deploy an auto-scaling HPC cluster with Terraform

In this section, you will deploy an auto-scaling HPC cluster including the Slurm job scheduler in GCP.

Open your Cloud Shell on GCP.
Clone the FluidNumerics/slurm-gcp repository

cd ~
git clone https://github.com/FluidNumerics/slurm-gcp.git

Change to the WRF directory:

cd  ~/slurm-gcp/tf/examples/wrf

Create and review a terraform plan. Set the environment variables WRF_NAME, WRF_PROJECT, WRF_ZONE, WRF_MAX_NODE, and WRF_MACHINE_TYPE to specify the name of your cluster, your GCP project, the zone you want to deploy to, the max number of nodes, and the machine type. For the CONUS 2.5km benchmark, we recommend using c2-standard-60 instances with at least 8 nodes available run jobs with 480 MPI ranks.

export WRF_PROJECT=<PROJECT ID>
export WRF_ZONE=<ZONE>
export WRF_NAME=wrf-large
export WRF_MAX_NODE=5
export WRF_MACHINE_TYPE="c2-standard-60"

If you did not do it above, you must run terraform init to start up terraform:

terraform init

Create the plan with the make command.

make plan

Deploy the cluster. The installation and setup process can take up to 2 hours. During the deployment, WRF and all of its dependencies will be installed.

make apply

SSH to the login node created in the previous step. You can see this node in the previous step (probably called wrf-large-login0). You can do this by clicking on the SSH button next to the list of VM Instances in the console menu item Compute Engine -> VM instance.

Option: This pair of gcloud commands will figure out the login node name and SSH into it:

export CLUSTER_LOGIN_NODE=$(gcloud compute instances list --zones ${WRF_ZONE} --filter="name ~ .*login" --format="value(name)" | head -n1)

gcloud compute ssh ${CLUSTER_LOGIN_NODE} --zone ${WRF_ZONE}

The second command should result in you being connected to the Slurm Login node.

Once you are connected to the login node, to verify your cluster setup, check that the wrf module is available.

$ module load gcc && module load openmpi && module avail
-------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/openmpi/4.0.5-eagetxh/gcc/9.2.0 --------------------------------------
   hdf5/1.10.7    netcdf-c/4.7.4    netcdf-fortran/4.5.3    parallel-netcdf/1.12.1    wrf/4.2

------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/gcc/9.2.0 -------------------------------------------------
   hwloc/2.2.0      libiconv/1.16          libpng/1.6.37     nasm/2.15.05            openmpi/4.0.5 (L,D)    time/1.9              zlib/1.2.11
   jasper/2.0.16    libjpeg-turbo/2.0.4    libtirpc/1.2.6    ncurses/5.9.20130511    perl/5.16.3            util-macros/1.19.1
   krb5/1.15.1      libpciaccess/0.16      libxml2/2.9.10    numactl/2.0.14          tcsh/6.22.02           xz/5.2.2

--------------------------------------------------- /apps/spack/share/spack/lmod/linux-centos7-x86_64/Core ----------------------------------------------------
   gcc/9.2.0 (L)

---------------------------------------------------------------------- /apps/modulefiles ----------------------------------------------------------------------
   openmpi/v4.1.x

Verify that /apps/share/conus-2.5km has the contents listed below.

$ ls -1 /apps/share/conus-2.5km
FILE:2018-06-17_00
FILE:2018-06-17_03
FILE:2018-06-17_06
FILE:2018-06-17_09
FILE:2018-06-17_12
geo_em.d01.nc
geogrid.log
gfs.0p25.2018061700.f000.grib2
gfs.0p25.2018061700.f003.grib2
gfs.0p25.2018061700.f006.grib2
gfs.0p25.2018061700.f009.grib2
gfs.0p25.2018061700.f012.grib2
met_em.d01.2018-06-17_00:00:00.nc
met_em.d01.2018-06-17_03:00:00.nc
met_em.d01.2018-06-17_06:00:00.nc
met_em.d01.2018-06-17_09:00:00.nc
met_em.d01.2018-06-17_12:00:00.nc
metgrid.log
namelist.input
namelist.wps
ungrib.log
wrfbdy_d01
wrfinput_d01

6. Run the CONUS 2.5km Benchmark

To run the CONUS 2.5km benchmark, you will submit a Slurm batch job. The input decks for this benchmark are included in the wrf-gcp VM image under /apps/share/benchmarks/conus-2.5km.

For this section, you must be SSH connected to the login node of the cluster

Copy the example wrf-conus.sh batch file from /apps/share

cp /apps/share/wrf-conus2p5.sh ~/

Open the wrf-conus.sh in a text editor to verify the --partition and --ntasks are set correctly. The partition should be set to c2-60. The number of tasks should be set to the number of MPI ranks you want to use to launch the job. For this demonstration, the number of tasks is equivalent to the number of vCPUs used for the job, and should not exceed your available quota.

#!/bin/bash
#SBATCH --partition=c2-60
#SBATCH --ntasks=480
#SBATCH --ntasks-per-node=60
#SBATCH --mem-per-cpu=2g
#SBATCH --cpus-per-task=1
#SBATCH --account=default
#
# /////////////////////////////////////////////// #

WORK_PATH=${HOME}/wrf-benchmark/
SRUN_FLAGS="-n $SLURM_NTASKS --cpu-bind=threads"

. /apps/share/spack.sh
module load gcc/9.2.0
module load openmpi
module load hdf5 netcdf-c netcdf-fortran wrf

mkdir -p ${WORK_PATH}
cd ${WORK_PATH}
ln -s ${INSTALL_ROOT}/share/conus-2.5km/* .
ln -s $(spack location -i wrf)/run/* .

srun $MPI_FLAGS ./wrf.exe

Submit the batch job using sbatch.

sbatch wrf-conus2p5.sh

Wait for the job to complete. This benchmark is configured to run a 6-hour forecast, which takes about 1 hour to complete with 480 ranks. You can monitor the status of your job with squeue.
When the job completes, check the contents of rsl.out.0000 to verify that you see the statement "wrf: SUCCESS COMPLETE WRF". The numeric suffix will be different if you've run the job more than once, e.g., you got a config setting wrong and had to rerun it.

$ tail -n1 ${HOME}/wrf-benchmark/rsl.out.0000
d01 2018-06-17_06:00:00 wrf: SUCCESS COMPLETE WRF

7. Congratulations

In this codelab, you created an auto-scaling, cloud-native HPC cluster and ran a parallel WRF® simulation on Google Cloud Platform!

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab:

Delete the project

The easiest way to eliminate billing is to delete the project you created for the codelab.

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for this codelab, when you delete it, you also delete any other work you've done in the project.
Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple codelabs and quickstarts, reusing projects can help you avoid exceeding project quota limits.

In the Cloud Console, go to the Manage resources page. Go to the Manage resources page
In the project list, select the project that you want to delete and then click Delete .
In the dialog, type the project ID and then click Shut down to delete the project.

Delete the individual resources

Open your cloud shell and navigate to the wrf example directory

cd  ~/slurm-gcp/tf/examples/wrf

Run make destroy to delete all of the resources.

make destroy

Report a mistake