Create Spark ML models with Google Dataproc

1. Introduction

One of the core components of Apache Spark is Spark ML, a library for building machine learning models and pipelines built on top of the Apache Spark engine. From the website, it contains tools such as:

  • ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
  • Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • Persistence: saving and load algorithms, models, and Pipelines
  • Utilities: linear algebra, statistics, data handling, etc.

In this codelab, you will learn how to create a Spark ML model using a notebook.

2. Enable APIs

For this codelab, you must enable the following APIs:

Click this link to enable these APIs in your project. When prompted, confirm that the APIs will be enabled in the correct project.

3. Create and connect to a Vertex AI Workbench instance

In this section you will create a Vertex AI Workbench instance. You will then connect to it, clone a Github repository and run a notebook.

To create the Vertex AI Workbench instance, you can follow along with the instructions or follow along below.

  1. Go to the Managed Notebooks console page.
  2. Click NEW NOTEBOOK.
  3. Provide a name and choose a region such as us-central1 (Iowa). This should ideally match the region selected earlier in the codelab though it is not mandatory.
  4. Under Permission select Single user only.
  5. Open the Advanced Settings dropdown.
  6. Under Security select Enable nbconvert and Enable terminal.
  7. Click CREATE.

The instance should be provisioned within about five minutes. You'll see a green check mark next to the Notebook name when the instance is ready.

When the instance is ready, click OPEN JUPYTERLAB. Authenticate when prompted to do and enable all permissions.

4. Build models with Spark ML from a notebook

After the JupyterLab instance loads, you are in the Launcher tab. In this tab, under Other click Terminal to open a new Terminal.

In the terminal, clone the Vertex AI Samples repository.

git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git

In the File Browser tab, navigate to vertex-ai-samples/notebooks/official/workbench/spark. Open the notebook spark_ml.ipynb by double clicking on it. When prompted to select a kernel, select Python (local).

Walk through the steps of the notebook by executing each cell as you go. Follow along with the instructions in the cells.

5. Clean up resources

To avoid incurring unnecessary charges to your GCP account after completion of this codelab:

  1. Delete your Workbench instance. From the console, check the box next to your instance and click DELETE.

If you created a project just for this codelab, you can also optionally delete the project:

  1. In the GCP Console, go to the Projects page.
  2. In the project list, select the project you want to delete and click Delete.
  3. In the box, type the project ID, and then click Shut down to delete the project.