What you need

To complete this lab, you need:

Internet access

Access to a supported Internet browser:

What you do

What you learn

By the end of this lab you will build your own global IaaS Hadoop Cluster deployment service based on open source software. The service gives you a reliable and secure way to deploy any size Hadoop Cluster in any GCP region in minutes.

To accomplish this, you will create an IAM Service Account with the role of Project Editor. You will authorize and initialize the Google Cloud SDK on a VM. Then you will "bake" that authority into a reusable snapshot that can reconstitute the "clustermaker" VM in any GCP region.

The "clustermaker" VM is a micro type machine with just enough capacity to do the work needed to create and deploy Hadoop clusters using the Google Cloud SDK.

The clustermaker VM will need to be able to use the Google Cloud API. To do this, you will create an IAM Service Account with Project Editor role, download the Private Key, and copy it to the VM. Then you will need to go through an authorization and initialization process on the VM. You will need to modify the VM by installing Git, and then use Git to install an open source tool called bdutil (Big Data Utility). You will need to modify the configuration to select a machine type for the workers in the cluster. bdutil will use a GCS bucket to stage and install the Hadoop software. And Hadoop will be configured to use GCS rather than HDFS for its file system.

After you have verified that the Hadoop cluster created by "clustermaker" is working, you will create a snapshot of the boot persistent disk using the snapshot service.

The snapshot can be used to recreate the persistent disk in any region. And the persistent disk can be used to launch a VM in that region. You will test this by creating a new clustermaker VM in a different region and using it to launch another Hadoop cluster.

Activities:

Step 1 Create a Project

To make cleanup easier, create a new google cloud project.

Remember the project ID, a unique name across all Google Cloud projects. It will be referred to later in this codelab as PROJECT_ID.

Step 1 Create a service account

Console: Products and Services > IAM & Admin > Service accounts

Click on [Create Service Account]

Property

Value

Name:

clustermaker

Role:

Project > Editor

Select Furnish a new private key with Key type JSON.

Click [Create]

This will download a JSON key file. You will need to find this key file and open it in a text editor to make a copy of it on the VM in the next step.

Step 2 Create a VM instance

You are going to create the VM with a persistent disk that remains when the VM is deleted. This disk will be used to create the pattern you will use later to create a cluster provisioning VM.

Console: Products and Services > Compute Engine > VM instances

Click on [Create instance]

Property

Value

Name:

clustermaker

Zone:

us-east1-b

Machine type:

micro (shared vCPU)

You shouldn't need to change the following settings, just verify them.

Boot disk:

New 10GB, Debian Linux

Under Identity and API access:

Service account:

don't change

Compute Engine default service account

Firewall:

don't change

Click on the extended menu at the bottom to access advanced features.

Click on the Disks tab to get advanced options

Uncheck the the box before Delete boot disk when instance is deleted.

Click [Create]

Step 3 Authorize the VM to use the Cloud SDK

SSH to the VM.

This VM is going to use API calls to create the Hadoop cluster. So it needs to have the Google Cloud SDK installed and it needs to be authorized to use the API.

The SDK is required for the gcloud command line tool. So we can verify that the Google Cloud SDK is installed if the gcloud tool is installed. The -v (version) option will list the Google Cloud SDK version.

Check if the SDK is installed on this VM:

$ gcloud -v

Is the gcloud tool and the SDK installed on this standard GCP base image?

Now to see if the permissions are set up to make calls to the Google Cloud, let's try a simple gcloud command that will use the API:

$ gcloud compute zones list

What was the result?

Copy the JSON credentials to the VM.

  1. Find and open the downloaded JSON file in a text editor. The file was downloaded to your computer in Step 1.
  2. Select all the text and copy it. (Usually [CTRL][A] and [CTRL][C] ).
  3. On clustermaker in the SSH terminal, create a file named credentials.json.

Paste the contents of the JSON file into this new file on the VM.

In the vi editor, you would enter the following:

vi credentials.json
i  (for insert mode)
[CTRL][V] (to paste)
[ESC][:][W][Q] (to write the file and exit the editor)
  1. Authorize the VM

Use the gcloud auth command and the credentials file you just created to authorize the VM.

$ gcloud auth activate-service-account --key-file credentials.json
  1. Re-initialize gcloud

Re-initialize gcloud, which will now display the services account you created

$ glcoud init

Select option [1], to Re-initialize.

This time you should see the services account that you created in the list, with the name clustermaker.

Select option [2], the gcloud services account you created beginning with clustermaker.

Enter the Project_ID.

Do you want to configure Google Compute Engine 
(https://cloud.google.com/compute) settings (Y/n)?

Y

For the zone, enter the number corresponding to us-east1-b.

  1. Verify the configuration.

Verify that the VM is now authorized to use Google Cloud Platform API. Let's see what zones are available

$ gcloud compute zones list

This time the command should succeed.

$ rm credentials.json

You will customize the VM and install necessary software. The final step is to create the Hadoop cluster using the customized VM to verify that it is working.

We are going to be using an open source cluster automation tool called bdutil. The code is open source and is hosted in a Git repository. To copy that code to the VM, you first need to install Git.

Step 1 install Git

Make sure the package index is up to date.

$ sudo apt-get -qq update

Install Git.

$ sudo apt-get install -y -qq git

Step 2 Clone the source code

Use git to download the code to the VM.

$ git clone \
https://github.com/GoogleCloudPlatform/bdutil

Step 3 Modify the configuration

In the shell script bdutil_env.sh on the 38th line it says:

GCE_MACHINE_TYPE='n1-standard-4'

That will create worker nodes with 4 vCPUs each. That's a good size for a Hadoop cluster. But for demonstration purposes you will use a smaller machine. Edit the file and change n1-standard-4 to n1-standard-1.

Step 1 Create a Cloud Storage bucket

You will need a Google Cloud Storage bucket.

The application stages intermediate files during cluster VM creation and configuration. It will also serve at the base for the file system used by Hadoop. Hadoop can use the Cloud Storage file system instead of the Hadoop Distributed File System (HDFS). (HDFS was derived from an earlier version of Cloud Storage).

Console: Products and Services > Storage > Browser

Click on [Create bucket]

Give the bucket an appropriate globally unique name.

Common practice is to use the Project_ID as a namespace:

{Project_ID}-hadoop-storage

Make the bucket Multi-regional in the United States.

Step 2 Create a cluster

SSH to the VM.

The following commands will create the first Hadoop cluster.

$ cd bdutil

$ ./bdutil -b [Bucket Name] \
-z us-east1-b \
-n 2 \
-P [Project-ID] \
deploy

...
At (y/n) enter y

Step 3 Wait for completion

In console, view the VM instances to see them being created by the "clustermaker" VM.

Wait until bdutil completes in the terminal before proceeding to the next part of the lab.

You will now verify that the cluster is working. While you could run a MapReduce job, in the interest of time, you will just verify that the Hadoop file system is functioning correctly.

Step 1 SSH to the Hadoop Master

From the clustermaker VM terminal, SSH into the Hadoop Master VM.

The master will have the name of your Project_ID with the letter -m appended.

$ gcloud compute ssh --zone=us-east1-b \
--project=[Project ID] \
[Project-ID]-m

Step 2 Exercise the Hadoop file system

Create a Hadoop file system directory.

$ hadoop fs -mkdir testsetup

Download a file to use for the test. This is a Hadoop cluster setup document in html format from the Apache.org website.

$ curl \
http://hadoop.apache.org/docs/current/\
hadoop-project-dist/hadoop-common/\
ClusterSetup.html > setup.html

Copy the file into the directory you created.

$ hadoop fs -copyFromLocal setup.html \ 
testsetup

See if the file is in Hadoop.

$ hadoop fs -ls testsetup

Dump the contents of the file.

$ hadoop fs -cat testsetup/setup.html

Step 1 Exit the terminal

This should close the terminal window.

Step 2 Delete all the VMs

Recall that when you created the clustermaker VM you specified that the disk was not to be deleted when the VM is deleted.

Console: Products and Services > Compute Engine > VM instances

Select all the instanced and click [Delete]

Step 3 Create a Persistent Disk snapshot

Console: Products and Services > Compute Engine > Disks

The clustermaker disk should still exist.

Make a snapshot from the disk by clicking on the three vertical dots at the end of the row and choose [+ Create snapshot]

Give it the name clustermaker.

Step 1 Reconstitute the disk from the snapshot

Console: Products and Services > Compute Engine > Disks

Click [Create Disk]

Property

Value

Name:

newdisk1

Zone:

us-central1-c

Under Source type

Snapshot:

Select the snapshot you created

clustermaker

Notice that the new disk is in a completely different region and zone than the original.

Step 2 Start a VM from the disk

Create a VM from the new disk.

Console: Products and Services > Compute Engine > Disks

Click on the three vertical dots at the end of the row for newdisk1.

Select [Create instance]

Give it the name new-clustermaker.

Step 3 Launch a new Hadoop Cluster in a new Region

When the VM is operational, SSH to it and launch a 3-worker cluster in the new region.

Console: Products and Services > Compute Engine > VM instances

The following commands will create the new Hadoop cluster:

$ccd bdutil

$c./bdutil -b [Bucket Name] \
-z us-central1-c \
-n 2 \
-P [Project-ID] \
deploy

...
At (y/n) enter y

Congratulations! You have created a custom snapshot from which to launch Hadoop clusters of any size in any region as needed.

There are many ways to provide Big Data processing in GCP, including BigQuery, Cloud Dataproc, and Cloud Dataflow. You can also use 3rd party installation tools to deploy Hadoop clusters. In this lab you learned to create your own IaaS solution where you have access to and control over all the source code.

During this lab you learned many IaaS skills that can be leveraged to automate activities through the Google Cloud SDK. This is important for Site Reliability Engineering (SRE). You can build on what you have learned here by studying the bdutil source code to see how it works with the Cloud API.

┬ęGoogle, Inc. or its affiliates. All rights reserved. Do not distribute.