Welcome to the Google Codelab for running a Slurm cluster on Google Cloud Platform! By the end of this codelab you should have a solid understanding of the ease of provisioning and operating an auto-scaling Slurm cluster.

Google Cloud teamed up with SchedMD to release a set of tools that make it easier to launch the Slurm workload manager on Compute Engine, and to expand your existing cluster dynamically when you need extra resources. This integration was built by the experts at SchedMD in accordance with Slurm best practices.

If you're planning on using the Slurm on Google Cloud Platform integrations, or if you have any questions, please consider joining our Google Cloud & Slurm Community Discussion Group!

About Slurm

Basic architectural diagram of a stand-alone Slurm Cluster in Google Cloud Platform.

Slurm is one of the leading workload managers for HPC clusters around the world. Slurm provides an open-source, fault-tolerant, and highly-scalable workload management and job scheduling system for small and large Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions:

1. It allocates exclusive or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.

2. It provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.

3. It arbitrates contention for resources by managing a queue of pending work.

What you'll learn


Self-paced environment setup

Create a Project

If you don't already have a Google Account (Gmail or G Suite), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and open the Manage resources page:

Click Create Project.

Enter a project name. Remember the project ID (highlighted in red in the screenshot above). The project ID must be a unique name across all Google Cloud projects. If your project name is not unique Google Cloud will generate a random project ID based on the project name.

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "Conclusion" section at the end of this document). The Google Cloud Platform pricing calculator is available here.

New users of Google Cloud Platform are eligible for a $300 free trial.

Google Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.

Launch Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click Start Cloud Shell:

It should only take a few moments to provision and connect to the environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and simplifying authentication. Much, if not all, of your work in this lab can be done with simply a web browser or a Google Chromebook.

Once connected to the cloud shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID:

$ gcloud auth list

Command output:

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
$ gcloud config list project

Command output:

project = <PROJECT_ID>

If the project ID is not set correctly you can set it with this command:

$ gcloud config set project <PROJECT_ID>

Command output:

Updated property [core/project].

Download the Slurm Deployment Configuration

In the Cloud Shell session, execute the following command to clone (download) the Git repository that contains the Slurm for Google Cloud Platform deployment-manager files:

git clone https://github.com/SchedMD/slurm-gcp.git

Switch to the Slurm deployment configuration directory by executing the following command:

cd slurm-gcp

Configure Slurm Deployment YAML

This YAML file details the configuration of the deployment, the slurm version to deploy, and the machine instance types to deploy. You must modify this file to match your desired configuration.

At a minimum, you must fill in the "default_users" field with the full username(s) (email address) used to log in to Google Cloud Platform. For example, if you log into GCP using "myname@gmail.com", please fill in "myname@gmail.com" in the "default_users" field. If you intend for multiple users to be able to launch jobs on this cluster, add each of their usernames, separated by a comma.

In the Cloud Shell session, open the deployment configuration YAML file slurm-cluster.yaml. You can either use your preferred command line editor (vi, nano, emacs, etc.) or use the Cloud Console Code Editor to view the file contents:

The contents of the file will look like this:

 # [START cluster_yaml]
- path: slurm.jinja

- name: slurm-cluster
  type: slurm.jinja
    cluster_name            : g1
    static_node_count       : 2
    max_node_count          : 10

    zone                    : us-central1-b
    region                  : us-central1
    cidr                    :

    # Optional network configuration fields
    # READ slurm.jinja.schema for prerequisites
    #vpc_net                 : < my-vpc >
    #vpc_subnet              : < my-subnet >
    #shared_vpc_host_proj    : < my-shared-vpc-project-name >

    controller_machine_type : n1-standard-2
    compute_machine_type    : n1-standard-2
    login_machine_type      : n1-standard-2
    #login_node_count        : 0

    # Optional compute configuration fields
    #cpu_platform               : Intel Skylake
    #preemptible_bursting       : False
    #external_compute_ips       : False
    #private_google_access      : True

    #controller_disk_type       : pd-standard
    #controller_disk_size_gb    : 50
    #controller_labels          :
    #     key1 : value1
    #     key2 : value2

    #login_disk_type            : pd-standard
    #login_disk_size_gb         : 10
    #login_labels               :
    #     key1 : value1
    #     key2 : value2

    #compute_disk_type          : pd-standard
    #compute_disk_size_gb       : 10
    #compute_labels             :
    #     key1 : value1
    #     key2 : value2

    #nfs_apps_server            :
    #nfs_apps_dir               : /apps
    #nfs_home_server            :
    #nfs_home_dir               : /home
    #controller_secondary_disk          : True
    #controller_secondary_disk_type     : pd-standard
    #controller_secondary_disk_size_gb  : 300

    # Optional GPU configuration fields
    #gpu_type                   : nvidia-tesla-v100
    #gpu_count                  : 8

    # Optional timer fields
    #suspend_time               : 300

    default_users           : < GCP user email addr, comma separated >
    #slurm_version           : 18.08-latest
    #default_account         : default

#  [END cluster_yaml]

Within this YAML file there are several fields to configure. These include:

Advanced Configuration

If desired you may choose to install additional packages and software as part of the cluster deployment process. You may either do this by adding these packages to the startup-script.py or custom-install scripts, or by building and using an image with the desired software and configuration installed. Currently Slurm uses the Google CentOS 7 image by default.

In order to add additional packages to the startup-script.py, you may add additional yum installable packages to the python list "packages" in the "install_packages" python function. For more complex installation procedures, please add new python functions in the locations marked with the comment "# Add any additional installation functions here".

In order to use your own image, build an image with your own configuration based on a CentOS image. Next, replace the reference to the centos-7 image in the slurm.jinja and resume.py files with your own image, and test the change. In the future we will support a yaml field to specify your own image.

Deploy the Configuration

In the Cloud Shell session, execute the following command from the slurm-gcp folder:

gcloud deployment-manager deployments create google1 --config slurm-cluster.yaml

This command creates a deployment named google1. The operation can take a few minutes to complete, so please be patient.

Once the deployment has completed you will see output similar to:

Create operation operation-1515793351850-5629b244a3810-c9541e28-80863bfb completed successfully.
NAME                           TYPE                 STATE      ERRORS  INTENT
g1-all-internal-firewall-rule  compute.v1.firewall  COMPLETED  []
g1-compute1                    compute.v1.instance  COMPLETED  []
g1-compute2                    compute.v1.instance  COMPLETED  []
g1-controller                  compute.v1.instance  COMPLETED  []
g1-login1                      compute.v1.instance  COMPLETED  []
g1-no-ip-internet-route        compute.v1.route     COMPLETED  []
g1-slurm-network               compute.v1.network   COMPLETED  []
g1-ssh-firewall-rule           compute.v1.firewall  COMPLETED  []

Verify the Deployment

Follow these steps to view the deployment in Google Cloud Platform Console:

With the deployment's configuration verified let's confirm the cluster's instances are started. In the Cloud Platform Console, in the Products & Services menu, click Compute Engine.

Under VM instances review the four virtual machine instances that have been created by the deployment manager. This includes login1, controller, compute1, and compute2.

Notice that compute1 and compute2 may or may not have external IPs, according to the "external_compute_ips" field in the deployment YAML. If they have no external IPs the compute nodes route their traffic through a NAT gateway in the controller node.

Access the Slurm Cluster

In the Cloud Shell session, click the SSH button next to the login1 instance in Google Cloud Console. Alternatively, execute the following command, substituting <ZONE> for the login1 node's zone:

gcloud compute ssh google1-login1 --zone=<ZONE>

This command logs into the login1 virtual machine.

Tour of the Slurm CLI Tools

You're now logged in to your cluster's Slurm login node. This is the node that's dedicated to user/admin interaction, scheduling Slurm jobs, and administrative activity.

Let's run a couple commands to introduce you to the Slurm command line.

Execute the sinfo command to view the status of our cluster's resources:


Sample output of sinfo appears below. sinfo reports the nodes available in the cluster, the state of those nodes, and other information like the partition, availability, and any time limitation imposed on those nodes.

debug*       up   infinite      8  idle~ compute[3-10]
debug*       up   infinite      2   idle compute[1-2]

Since we have allocated two static nodes you see compute1 & compute2 listed as idle. The remaining nodes (compute3 through compute10) are marked as "idle~" (the node is in an idle and non-allocated mode, ready to be bursted to and spun up).

Next, execute the squeue command to view the status of our cluster's queue:


The expected output of squeue appears below. squeue reports the status of the queue for a cluster. This includes each the job ID of each job scheduled on the cluster, the partition the job is assigned to, the name of the job, the user that launched the job, the state of the job, the wall clock time the job has been running, and the nodes that job is allocated to. We don't have any jobs running, so the contents of this command is empty.


The Slurm commands "srun" and "sbatch" are used run jobs that are put into the queue. "srun" runs parallel jobs, and can be used as a wrapper for mpirun. "sbatch" is used to submit a batch job to slurm, and can call srun once or many times in different configurations. "sbatch" can take batch scripts, or can be used with the --wrap option to run the entire job from the command line.

Let's run a job so we can see Slurm in action and get a job in our queue!

Run a Slurm Job

While logged in to login1, use your preferred text editor to create a new file "hostname_batch":

vi hostname_batch

Copy the following text into the file to create a simple sbatch script:

#SBATCH --job-name=hostname_sleep_sample
#SBATCH --output=out_%j.txt
#SBATCH --nodes=2

srun hostname
sleep 60

This script defines the Slurm options first with the commented, "SBATCH" lines. First, the execution environment is defined as bash. The job name is defined as "hostname_sleep_sample". The output file is set as "output_%j.txt" where %j is substituted for the Job ID according to the Slurm Filename Patterns. This output file is written by each compute node to a local directory, in this case the directory the sbatch script is launched from. In our example this is the user's /home folder, which is a NFS-based shared file system. Finally the number of nodes this script should run on is defined as 2.

After the options are defined the executable commands are provided. This script will run the hostname command in a parallel manner through the srun command, and sleep for 60 seconds afterwards. You may also try modifying the script to execute a few other commands like date or whoami.

Execute the sbatch script using the sbatch command line:

sbatch hostname_batch

Running sbatch will return a Job ID for the scheduled job, for example:

Submitted batch job 2

If you see the error "Invalid account or account/partition combination specified", you did not correctly complete the "default_users" field in the YAML. Please run the following command to add your username to the Slurm database manually:

We can use the Job ID returned by the sbatch command to track and manage the job execution and resources. Execute the following command to view the Slurm job queue:


You will likely see the job you executed listed like below:

    3     debug hostname username  R       0:10      2 compute[1-2]

You can also execute the sinfo command to view the Slurm cluster info:


This will show the nodes listed in squeue in the "alloc" state, marking them as allocated for a job:

debug*       up   infinite      8  idle~ compute[3-10]
debug*       up   infinite      2  alloc compute[1-2]

Once the job is complete, it will no longer be listed in squeue, and the "alloc" nodes in sinfo will return to the "idle" state. The output file out_%j.txt will have been written to your NFS-shared /home folder, and will contain the hostnames. Open or cat the output file (typically out_2.txt), it contents of the output file will contain:


Great work, you've run a job on your Slurm cluster!

Scale a Slurm Cluster

Now let's try to auto-scale a cluster. While logged in to login1, open the hostname_batch script we created earlier. Edit the script to change the nodes field to 4 nodes:

#SBATCH --nodes=4

Then execute the sbatch script using the sbatch command line:

sbatch hostname_batch

This will run the hostname command across 4 nodes, with one task per node, as well as printing the output to the out_3.txt file. This process will be run in the background to allow us to monitor the cluster as it is scaled.

Since we had 2 nodes already statically provisioned we should see two additional nodes automatically provisioned and added to the slurm cluster. This will take just a few minutes. The automatic nature of this process has two benefits. First, it eliminates the work typically required in a HPC cluster of manually provisioning nodes, configuring the software, integrating the node into the cluster, and then deploying the job. Second, it allows users to save money because idle, unused nodes are scaled down until the minimum number of nodes is running.

While the additional nodes are being provisioned, run sinfo to watch the nodes being allocated:


You can also check the VM instances section in Google Cloud Console to view the newly provisioned nodes. It will take a few minutes to spin up the nodes and get Slurm installed before the job is allocated to the newly allocated nodes.

Monitor squeue until the job has completed and no longer listed:


Once completed open or cat the latest output file (typically out_3.txt) and confirm it ran on compute1-4:


After being idle for 30 minutes (configurable within slurm.conf's SuspendTime field) the dynamically provisioned compute nodes will be de-allocated to release resources. You can validate this by running sinfo periodically and observing the cluster size fall back to 2.

Congratulations, you've created a Slurm cluster on Google Cloud Platform and used its latest features to auto-scale your cluster to meet workload demand! You can use this model to run any variety of jobs, and it scales to hundreds of instances in minutes by simply requesting the nodes in Slurm.

If you would like to continue learning to use Slurm on GCP, be sure to continue with the "Building Federated HPC Clusters with Slurm" codelab. This codelab will guide you through setting up two federated Slurm clusters in the cloud, to represent how you might achieve a multi-cluster federation, whether on-premise or in the cloud.

Are you building something cool using Slurm's new GCP-native functionality? Have questions? Have a feature suggestion? Reach out to the Google Cloud team today through Google Cloud's High Performance Computing Solutions website, or chat with us in the Google Cloud & Slurm Discussion Group!

Clean Up the Deployment

Logout of the slurm node:


Let any auto-scaled nodes scale down before deleting the deployment. You can also delete these nodes manually using "gcloud compute instances delete compute[X-Y]", or by using the Console GUI to select multiple nodes and clicking "Delete".

You can easily clean up the deployment after we're done by executing the following command from your Google Cloud Shell, after logging out of login1:

gcloud deployment-manager deployments delete google1

When prompted, type Y to continue. This operation can take some time, please be patient.

Delete the Project

To cleanup, we simply delete our project.

What we've covered

Find Slurm Support

If you need support using these integrations in testing or production environments please contact SchedMD directly using their contact page here: https://www.schedmd.com/contact.php

You may also use SchedMD's Troubleshooting guide here: https://slurm.schedmd.com/troubleshoot.html

Finally you may also post your question to the Google Cloud & Slurm Discussion Group found here: https://groups.google.com/forum/#!forum/google-cloud-slurm-discuss

Learn More


Please submit feedback about this codelab using this link. Feedback takes less than 5 minutes to complete. Thank you!