Welcome to the Google Codelab for using Google Pipelines API to run batch processing jobs. By the end of this codelab you will be able to leverage the Google Genomics Pipelines API to launch parallel batch jobs for scalable batch execution.

What you'll learn

Prerequisites

Codelab-at-a-conference setup

If you see a "request account button" at the top of the main Codelabs window, click it to obtain a temporary account. Otherwise ask one of the staff for a coupon with username/password.

These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.

Here's what you should see once logged in:

When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :

Self-paced environment setup

Create a Project

If you don't already have a Google Account (Gmail or G Suite), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and open the Manage resources page:

Click Create Project.

Enter a project name. Remember the project ID (highlighted in red in the screenshot above). The project ID must be a unique name across all Google Cloud projects. If your project name is not unique Google Cloud will generate a random project ID based on the project name.

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "Conclusion" section at the end of this document). The Google Cloud Platform pricing calculator is available here.

New users of Google Cloud Platform are eligible for a $300 free trial.

Google Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.

Enable API

  1. In the Product & Services menu, click APIs & Services, and then click Enable APIs and Services.
  2. Enable the Genomics API
  1. In the search field, type "Genomics"
  2. Click Genomics API, and then click Enable.
  3. Enable the Google Container Registry API in the same way.
  4. Enable the Google Cloud Container Builder API.
  5. Enable the Google Compute Engine API (this might be enabled already).

Launch Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click Start Cloud Shell:

It should only take a few moments to provision and connect to the environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and simplifying authentication. Much, if not all, of your work in this lab can be done with simply a web browser or a Google Chromebook.

Once connected to the cloud shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID:

$ gcloud auth list


Command output:

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
$ gcloud config list project


Command output:

[core]
project = <PROJECT_ID>


If the project ID is not set correctly you can set it with this command:

$ gcloud config set project <PROJECT_ID>

Command output:

Updated property [core/project].

In High Performance Computing (HPC) and High Throughput Computing (HTC) users often have an application that they need to run in parallel across a large number of nodes. As a demonstration, we will use a prime number generator application that takes a start and end range as inputs, and calculates the number of prime numbers within that range. As the range increases, the application takes longer to run. This application is easy to parallelize by running the application multiple times with different range values.

The code for this is simple but is illustrative of the process of package an application and running it within the Pipelines frameworks.

Create a Docker Container Specification

In your Cloud Shell session, create a new directory called primes and change into that directory. Open a new file called primes.c:

$ mkdir primes; cd primes
$ vim primes.c

Copy and paste the C application below:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char **argv)
{
        int failure = 0;
        if (argc != 3) {
                printf("Usage: primes <start> <max up to 2^31>\n");
                failure = 1;
        } else {
                clock_t start, end;
                double runTime;
                start = clock();
                int i, primes = 0;
                int num = atoi(argv[1]);
                int max = atoi(argv[2]);
                while (num <= max) {
                        i = 2;
                        while (i <= num) {
                                if (num % i == 0)
                                        break;
                                i++;
                        }
                        if (i == num)
                                primes++;
                        num++;
                }
                end = clock();
                runTime = (end - start) / (double)CLOCKS_PER_SEC;
                printf
                    ("This machine calculated all %d prime numbers from %s to %s in %g seconds\n",
                     primes, argv[1], argv[2], runTime);
        }
        return failure;
}

Review this program to understand how it works. We will use the gcc compiler to compile this C application. We will do this by creating a container specification.

Open a new file named Dockerfile in the current directory (~/primes):

$ vim Dockerfile

Copy the following contents into our new Dockerfile, replacing <YOUR_NAME> and <YOUR_EMAIL> with your own information:

FROM debian:latest

MAINTAINER <YOUR_NAME> "<YOUR_EMAIL>"

RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y gcc make

################## BEGIN INSTALLATION ######################

WORKDIR /root
ADD primes.c primes.c
RUN gcc -o primes primes.c

##################### INSTALLATION END #####################

# Default job to execute
CMD ["/root/primes", "0", "100"]

# Set default container command
#ENTRYPOINT ["/root/primes"]

Now we use Google Container Builder to build an image:

$ gcloud builds submit --tag gcr.io/${DEVSHELL_PROJECT_ID}/primes-image .

This operation can take a few minutes. When it is finished, you will see something like:

$ DONE
-------------------------------------------------------------------------------------------------------------------------

ID                                    CREATE_TIME                DURATION  SOURCE                                                                                          IMAGES                                              STATUS
239aff96-e378-46c3-916d-8f595a1dd8b7  2018-01-11T23:19:25+00:00  39S       gs://..._cloudbuild/source/1515712764.07-8df022d6ad1a4f269188ce129dfda5d8.tgz  gcr.io/.../primes-image (+1 more)  SUCCESS

Once completed, open the Google Cloud Console and go to the Container Registry under Tools. Select primes-image, and confirm that the image is available in your project with a tag of latest.

Create a New Working Directory and Create a Basic Spec file

In the Cloud Shell session, execute the following commands:

$ mkdir ~/pipelines; cd ~/pipelines

Open a new file named primes-simple.yaml in the current directory:

$ vim primes-simple.yaml

Copy the following contents into your new primes-simple.yaml file, replacing <PROJECT_ID> with your project ID:

name: primes-simple
docker:
  imageName: gcr.io/<PROJECT_ID>/primes-image:latest
  cmd: "/root/primes 0 1000"

This command will create a new pipeline named primes-simple that uses the application container we created in the previous section.

The Pipelines service will provision a GCE node, download the specified docker image and run the specified command. In this case, this will find the number of prime numbers between 0 and 1000.

Create a GCS Bucket

We will use a GCS bucket for storage of log files, as well as inputs / outputs of the programs.

Create a bucket using the gsutil command line:

$ export BUCKET_ID="${DEVSHELL_PROJECT_ID}_bucket"
$ gsutil mb gs://${BUCKET_ID}

Run an Instance of the Pipeline

Now run the pipeline using the command:

$ gcloud alpha genomics pipelines run --pipeline-file primes-simple.yaml --logging gs://${BUCKET_ID}/primes/logs/primes-simple.log

You must specify a Google Cloud Storage bucket to store log files. This will include standard error and standard output as well as the Pipelines process.

When run, this returns an Operation ID, where the X's are the Operation ID:

Running [operations/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX].

You can monitor the progress of this operation using the following command:

$ gcloud alpha genomics operations describe [OPERATION_ID]

The done and events fields will show whether the entire job is complete, and events will log the events in launching the pipeline.

You may also view the status of the pipeline operation in the Google Cloud Console by clicking on the Product Menu in the top left (three horizontal lines), and going to Genomics > Operations. Your operation should be listed towards the top, and will likely still be marked Running.

When complete the output files can be retrieved from the new Google Cloud Storage bucket. You can see the calculation result in the standard output file

primes-simple-stdout.log:

$ gsutil cat gs://${BUCKET_ID}/primes/logs/primes-simple-stdout.log

In the previous section, we ran a single instance through the Pipelines service. The service provisions a node, downloads the Docker image and runs the image as specified in the pipelines specification. While that's great, we haven't really parallelized the application, only migrated it to the cloud.

In this section, we parameterize the pipeline so we can programatically set the range, and use bash to break a large computation into many small computations that are run through the Pipelines service.

We could also specify additional resource constraints (cores, RAM), request to use preemptible vms to save cost, or specify additional regions or zones in which to run. See the documentation linked to below for details on how to do that.

Parameterize the Pipeline

First we define additional inputs to the pipeline that we can then specify on the command line. Open the new files primes-distributed.yaml:

$ vim primes-distributed.yaml

Copy the following contents into the new primes-distributed.yaml, replacing <PROJECT_ID> with your project id:

name: primes-distributed
inputParameters:
  - name: START
    defaultValue: "0"
  - name: END
    defaultValue: "1000"
docker:
  imageName: gcr.io/<PROJECT_ID>/primes-image:latest
  cmd: "/root/primes ${START} ${END}"

Next we'll create a bash shell script called primes.sh to launch the jobs:

$ vim primes.sh

Copy the following contents into your primes.sh script:

#!/bin/bash

# spinning up tasks
echo "spinning up tasks"

rm -f operations
touch operations

for x in {0..50}; do
      echo -n "."
      gcloud alpha genomics pipelines run --pipeline-file primes-distributed.yaml --logging gs://${BUCKET_ID}/primes/logs/primes$x.log --inputs START=$((x*1000 + 1)),END=$(((x+1)*1000))  >> operations 2>&1
done

echo -e "\ncomplete"

echo "waiting for all jobs to complete"

for op in `cat operations | cut -d / -f 2 | cut -d ']' -f 1`; do
      echo -n "."
      CMD="gcloud --format='value(done)' alpha genomics operations describe $op"
      while [[ $(eval ${CMD}) != "True" ]]; do echo -n "X"; sleep 5; done
done

echo -e "\nall jobs done"

echo "outputs:"
for x in {0..50}; do
      gsutil cat gs://${BUCKET_ID}/primes/logs/primes$x-stdout.log
done

Run the script:

$ chmod u+x primes.sh
$ ./primes.sh

The script submits 100 jobs to the Pipelines service each with a different range. The service will provision, execute and destroy the nodes as needed. Since the job execution is asynchronous, the script monitors the execution of the operations and only when all are complete, it downloads the outputs. This process can take some time, so please be patient.

Cleaning Up the Deployment

Delete your GCS bucket, and your project if you wish.

$ gsutil rb gs://[BUCKET]/

Delete the Project

To cleanup, we simply delete our project.

What we've covered

Please spend a few seconds to give us some feedback so we can improve this learning.