Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.

This lab is adapted from https://cloud.google.com/dataproc/quickstart-console

What you'll learn

What you'll need

How will you use use this tutorial?

Read it through only Read it and complete the exercises

How would you rate your experience with using Google Cloud Platform services?

Novice Intermediate Proficient

Self-paced environment setup

The steps in this section prepare a project for working with Dataproc. Once executed, you will not need to do these steps again when working with Dataproc in the same project.

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document).

New users of Google Cloud Platform are eligible for a $300 free trial.

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Click on Enable API.

Search for "Google Compute Engine" in the search box. Click on "Google Compute Engine API" in the results list that appears.

On the Google Compute Engine page click Enable

Once it has enabled click the arrow to go back.

Now search for "Google Cloud Dataproc API" and enable it as well.

In the Google Developer Console, click the Menu icon on the top left of the screen:

Then navigate to Dataproc in the drop down.

After clicking, you should see the following if the project has no clusters:

To create a new cluster, click Create cluster.

There are many parameters you can configure when creating a new cluster. Most of the default cluster settings, which includes two worker nodes, should be sufficient for this tutorial. Let's also use the following:

Name

gcelab

Zone

us-central1-c

Learn more about zones in Regions & Zones documentation.

Machine type (Master node)

n1-standard-2

Machine type (Worker nodes)

n1-standard-2

Click on Create to create the new cluster!

Select Jobs in the left nav to switch to Dataproc's jobs view.

Click Submit job.

Select your new cluster gcelab from the Cluster drop-down menu.

Select Spark from the Job type drop-down menu.

Enter file:///usr/lib/spark/examples/jars/spark-examples.jar in the Jar files field.

Enter org.apache.spark.examples.SparkPi in the Main class or jar field.

Enter 1000 in the Arguments field to set the number of tasks.

Click Submit.

Your job should appear in the Jobs list, which shows all your project's jobs with their cluster, type, and current status. The new job displays as "Running" , and then "Succeeded" once it completes.

To see your completed job's output:

Click the job ID in the Jobs list.

Select Line Wrapping to avoid scrolling.

You should see that your job has successfully calculated a rough value for pi!

You can shut down a cluster on the Clusters page.

Select the checkbox next to the gcelab cluster.

Then click Delete.

You learned how to create a Dataproc cluster, submit a Spark job, and shut down your cluster!

Learn More

License

This work is licensed under a Creative Commons Attribution 3.0 Generic License, and Apache 2.0 license.