About this codelab
1. Introduction
One of the core components of Apache Spark is Spark ML, a library for building machine learning models and pipelines built on top of the Apache Spark engine. From the website, it contains tools such as:
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
In this codelab, you will learn how to create a Spark ML model using a notebook.
3. Create and connect to a Vertex AI Workbench instance
In this section you will create a Vertex AI Workbench instance. You will then connect to it, clone a Github repository and run a notebook.
To create the Vertex AI Workbench instance, you can follow along with the instructions or follow along below.
- Go to the Managed Notebooks console page.
- Click NEW NOTEBOOK.
- Provide a name and choose a region such as us-central1 (Iowa). This should ideally match the region selected earlier in the codelab though it is not mandatory.
- Under Permission select Single user only.
- Open the Advanced Settings dropdown.
- Under Security select Enable nbconvert and Enable terminal.
- Click CREATE.
The instance should be provisioned within about five minutes. You'll see a green check mark next to the Notebook name when the instance is ready.
When the instance is ready, click OPEN JUPYTERLAB. Authenticate when prompted to do and enable all permissions.
4. Build models with Spark ML from a notebook
After the JupyterLab instance loads, you are in the Launcher tab. In this tab, under Other click Terminal to open a new Terminal.
In the terminal, clone the Vertex AI Samples repository.
git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git
In the File Browser tab, navigate to vertex-ai-samples/notebooks/official/workbench/spark. Open the notebook spark_ml.ipynb by double clicking on it. When prompted to select a kernel, select Python (local).
Walk through the steps of the notebook by executing each cell as you go. Follow along with the instructions in the cells.
5. Clean up resources
To avoid incurring unnecessary charges to your GCP account after completion of this codelab:
- Delete your Workbench instance. From the console, check the box next to your instance and click DELETE.
If you created a project just for this codelab, you can also optionally delete the project:
- In the GCP Console, go to the Projects page.
- In the project list, select the project you want to delete and click Delete.
- In the box, type the project ID, and then click Shut down to delete the project.