Provisioning and Using a Managed Hadoop/Spark Cluster with Cloud Dataproc (Command Line)

20 mins remaining

About this codelab

Last updated May 3, 2022

Written by a Googler

1. Overview

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

This tutorial is adapted from https://cloud.google.com/dataproc/overview

What you'll learn

How to create a managed Cloud Dataproc cluster (with Apache Spark pre-installed).
How to submit a Spark job
How to resize a cluster
How to ssh into the master node of a Dataproc cluster
How to use gcloud to examine clusters, jobs, and firewall rules
How to shut down your cluster

What you'll need

A Browser, such Chrome or Firefox

How will you use use this tutorial?

Read it through onlyRead it and complete the exercises

How would you rate your experience with using Google Cloud Platform services?

NoviceIntermediateProficient

2. Setup and Requirements

Self-paced environment setup

Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one.)

dMbN6g9RawQj_VXCSYpdYncY-DbaRzr2GbnwoV7jFf1u3avxJtmGPmKpMYgiaMH-qu80a_NJ9p2IIXFppYk8x3wyymZXavjglNLJJhuXieCem56H30hwXtd8PvXGpXJO9gEUDu3cZw

ci9Oe6PgnbNuSYlMyvbXF1JdQyiHoEgnhl4PlV_MFagm2ppzhueRkqX4eLjJllZco_2zCp0V0bpTupUSKji9KkQyWqj11pqit1K1faS1V6aFxLGQdkuzGp4rsQTan7F01iePL5DtqQ

8-tA_Lheyo8SscAVKrGii2coplQp2_D1Iosb2ViABY0UUO1A8cimXUu6Wf1R9zJIRExL5OB2j946aIiFtyKTzxDcNnuznmR45vZ2HMoK3o67jxuoUJCAnqvEX6NgPGFjCVNgASc-lg

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

3. Enable the Cloud Dataproc and Google Compute Engine APIs

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Click on Enable APIs and Services.

Search for "Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.

On the Google Compute Engine page click Enable

Once it has enabled click the arrow pointing left to go back.

Now search for "Google Cloud Dataproc API" and enable it as well.

4. Start Cloud Shell

This Debian-based virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. This means that all you will need for this codelab is a browser (yes, it works on a Chromebook).

To activate Cloud Shell from the Cloud Console, simply click Activate Cloud Shell (it should only take a few moments to provision and connect to the environment).

I5aEsuNurCxHoDFjZRZrKBdarPPKPoKuExYpdagmdaOLKe7eig3DAKJitIKyuOpuwmrMAyZhp5AXpmD_k66cBuc1aUnWlJeSfo_aTKPY9aNMurhfegg1CYaE11jdpSTYNNIYARe01A

Screen Shot 2017-06-14 at 10.13.43 PM.png

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)

gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If, for some reason, the project is not set, simply issue the following command:

gcloud config set project <PROJECT_ID>

Looking for your PROJECT_ID? Check out what ID you used in the setup steps or look it up in the Cloud Console dashboard:

R7chO4PKQfLC3bvFBNZJALLTUiCgyLEq_67ECX7ohs_0ZnSjC7GxDNxWrJJUaoM53LnqABYamrBJhCuXF-J9XBzuUgaz7VvaxNrkP2TAn93Drxccyj2-5zz4AxL-G3hzxZ4PsM5HHQ

Cloud Shell also sets some environment variables by default, which may be useful as you run future commands.

echo $GOOGLE_CLOUD_PROJECT

Command output

<PROJECT_ID>

Finally, set the default zone and project configuration.

gcloud config set compute/zone us-central1-f

You can choose a variety of different zones. For more information, see Regions & Zones.

5. Create a Cloud Dataproc cluster

After Cloud Shell launches, you can use the command line to invoke the Cloud SDK gcloud command or other tools available on the virtual machine instance.

Choose a cluster name to use in this lab:

$ CLUSTERNAME=${USER}-dplab

Let's get started by creating a new cluster:

$ gcloud dataproc clusters create ${CLUSTERNAME} \
  --region=us-central1 \
  --scopes=cloud-platform \
  --tags codelab \
  --zone=us-central1-c

The default cluster settings, which includes two-worker nodes, should be sufficient for this tutorial. The command above includes the --zone option to specify the geographic zone in which the cluster will be created, and two advanced options, --scopes and --tags, which are explained below when you use the features they enable. See the Cloud SDK gcloud dataproc clusters create command for information on using command line flags to customize cluster settings.

6. Submit a Spark job to your cluster

You can submit a job via a Cloud Dataproc API jobs.submit request, using the gcloud command line tool, or from the Google Cloud Platform Console. You can also connect to a machine instance in your cluster using SSH, and then run a job from the instance.

Let's submit a job using gcloud tool from the Cloud Shell command line:

$ gcloud dataproc jobs submit spark --cluster ${CLUSTERNAME} \
  --class org.apache.spark.examples.SparkPi \
  --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

As the job runs you will see the output in your Cloud Shell window.

Interrupt the output by entering Control-C. This will stop the gcloud command, but the job will still be running on the Dataproc cluster.

7. List Jobs and Reconnect

Print a list of jobs:

$ gcloud dataproc jobs list --cluster ${CLUSTERNAME}

The most recently submitted job is at the top of the list. Copy the job ID and paste it in place of "jobId" in the below command. The command will reconnect to the specified job and display its output:

$ gcloud dataproc jobs wait jobId

When the job finishes, the output will include an approximation of the value of Pi.

8. Resize Cluster

For running larger computations, you might want to add more nodes to your cluster to speed it up. Dataproc lets you add nodes to and remove nodes from your cluster at any time.

Examine the cluster configuration:

$ gcloud dataproc clusters describe ${CLUSTERNAME}

Make the cluster larger by adding some preemptible nodes:

$ gcloud dataproc clusters update ${CLUSTERNAME} --num-secondary-workers=2

Examine the cluster again:

$ gcloud dataproc clusters describe ${CLUSTERNAME}

Note that in addition to the workerConfig from the original cluster description, there is now also a secondaryWorkerConfig that includes two instanceNames for the preemptible workers. Dataproc shows the cluster status as being ready while the new nodes are booting.

Since you started with two nodes and now have four, your Spark jobs should run about twice as fast.

9. SSH into Cluster

Connect via ssh to the master node, whose instance name is always the cluster name with -m appended:

$ gcloud compute ssh ${CLUSTERNAME}-m --zone=us-central1-c

The first time you run an ssh command on Cloud Shell it will generate ssh keys for your account there. You can choose a passphrase, or use a blank passphrase for now and change it later using ssh-keygen if you want.

On the instance, check the hostname:

$ hostname

Because you specified --scopes=cloud-platform when you created the cluster, you can run gcloud commands on your cluster. List the clusters in your project:

$ gcloud dataproc clusters list

Log out of the ssh connection when you are done:

$ logout

10. Examine tags

When you created your cluster you included a --tags option to add a tag to each node in the cluster. Tags are used to attach firewall rules to each node. You did not create any matching firewall rules in this codelab, but you can still examine the tags on a node and the firewall rules on the network.

Print the description of the master node:

$ gcloud compute instances describe ${CLUSTERNAME}-m --zone us-central1-c

Look for tags: near the end of the output and see that it includes codelab.

Print the firewall rules:

$ gcloud compute firewall-rules list

Note the SRC_TAGS and TARGET_TAGS columns. By attaching a tag to a firewall rule, you can specify that it should be used on all nodes that have that tag.

11. Shut down your cluster

You can shut down a cluster via a Cloud Dataproc API clusters.delete request, from the command line using the gcloud dataproc clusters delete executable, or from the Google Cloud Platform Console.

Let's shut down the cluster using the Cloud Shell command line:

$ gcloud dataproc clusters delete ${CLUSTERNAME} --region us-central1

12. Congratulations!

You learned how to create a Dataproc cluster, submit a Spark job, resize a cluster, use ssh to log in to your master node, use gcloud to examine clusters, jobs, and firewall rules, and shut down your cluster using gcloud!

Learn More

Dataproc Documentation: https://cloud.google.com/dataproc/overview
Getting Started with Dataproc using the Console codelab

License

This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.

Report a mistake