The Cloud Natural Language API lets you extract entities from text, perform sentiment and syntactic analysis, and classify text into categories. In this lab, we'll focus on text classification. Using a database of 700+ categories, this API feature makes it easy to classify a large dataset of text.

What you'll learn

What you'll need

How will you use this tutorial?

Read it through only Read it and complete the exercises

How would rate your experience with Google Cloud Platform?

Novice Intermediate Proficient

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in the Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document).

New users of Google Cloud Platform are eligible for a $300 free trial.

Click on the menu icon in the top left of the screen.

Select APIs & services from the drop down and click on Dashboard

Click on Enable APIs and services.

Then, search for "language" in the search box. Click on Google Cloud Natural Language API:

Click Enable to enable the Cloud Natural Language API:

Wait for a few seconds for it to enable. You will see this once it's enabled:

Google Cloud Shell is a command line environment running in the Cloud. This Debian-based virtual machine is loaded with all the development tools you'll need (gcloud, bq, git and others) and offers a persistent 5GB home directory. We'll use Cloud Shell to create our request to the Natural Language API.

To get started with Cloud Shell, Click on the "Activate Google Cloud Shell" icon in top right hand corner of the header bar

A Cloud Shell session opens inside a new frame at the bottom of the console and displays a command-line prompt. Wait until the user@project:~$ prompt appears

Since we'll be using curl to send a request to the Natural Language API, we'll need to generate an API key to pass in our request URL. To create an API key, navigate to the Credentials section of APIs & services in your Cloud console:

In the drop down menu, select API key:

Next, copy the key you just generated. You will need this key later in the lab.

Now that you have an API key, save it to an environment variable to avoid having to insert the value of your API key in each request. You can do this in Cloud Shell. Be sure to replace <your_api_key> with the key you just copied.

export API_KEY=<YOUR_API_KEY>

Using the Natural Language API's classifyText method, we can sort our text data into categories with a single API call. This method returns a list of content categories that apply to a text document. These categories range in specificity, from broad categories like /Computers & Electronics to highly specific categories such as /Computers & Electronics/Programming/Java (Programming Language). A full list of the 700+ possible categories can be found here.

We'll start by classifying a single article, and then we'll see how we can use this method to make sense of a large news corpus. To start, let's take this headline and description from a New York Times article in the food section:

A Smoky Lobster Salad With a Tapa Twist. This spin on the Spanish pulpo a la gallega skips the octopus, but keeps the sea salt, olive oil, pimentón and boiled potatoes.

In your Cloud Shell environment, create a request.json file with the code below. You can either create the file using one of your preferred command line editors (nano, vim, emacs) or use the built-in Orion editor in Cloud Shell:

request.json

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"A Smoky Lobster Salad With a Tapa Twist. This spin on the Spanish pulpo a la gallega skips the octopus, but keeps the sea salt, olive oil, pimentón and boiled potatoes."
  }
}

Now we can send this text to the NL API's classifyText method with the following curl command:

curl "https://language.googleapis.com/v1/documents:classifyText?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json

Let's take a look at the response:

{ categories: 
  [ 
    { 
      name: '/Food & Drink/Cooking & Recipes',
       confidence: 0.85 
    },
    { 
       name: '/Food & Drink/Food/Meat & Seafood',
       confidence: 0.63 
     }
  ] 
}

The API returned 2 categories for this text: /Food & Drink/Cooking & Recipes and /Food & Drink/Food/Meat & Seafood. The text doesn't explicitly mention that this is a recipe or even that it includes seafood, but the API is able to categorize it for us. Classifying a single article is cool, but to really see the power of this feature we should classify lots of text data.

To see how the classifyText method can help us understand a dataset with lots of text, we'll use this public dataset of BBC news articles. The dataset consists of 2,225 articles in five topic areas (business, entertainment, politics, sport, tech) from 2004 - 2005. We've put a subset of these articles into a public Google Cloud Storage bucket. Each of the articles is in a .txt file.

To examine the data and send it to the NL API, we'll write a Python script to read each text file from Cloud Storage, send it to the classifyText endpoint, and store the results in a BigQuery table. BigQuery is Google Cloud's big data warehouse tool - it lets us easily store and analyze large datasets.

To see the type of text we'll be working with, run the following command to view one article (gsutil provides a command line interface for Cloud Storage):

gsutil cat gs://text-classification-codelab/bbc_dataset/entertainment/001.txt

Next we'll create a BigQuery table for our data.

Before we send the text to the Natural Language API, we need a place to store the text and category for each article - enter BigQuery! Navigate to the BigQuery web UI in your console:

Then click on the dropdown arrow next to your project name and select Create new dataset:

Name your dataset news_classification_dataset. You can leave the defaults in the Data location and Data expiration fields:

Click on the dropdown arrow next to your dataset name and select Create new table. Under Source Data, select "Create empty table". Then name your table article_data and give it the following 3 fields in the schema:

After creating the table you should see the following:

Our table is empty right now. In the next step we'll read the articles from Cloud Storage, send them to the NL API for classification, and store the result in BigQuery.

Before we write a script to send the news data to the NL API, we need to create a service account. We'll use this to authenticate to the NL API and BigQuery from our Python script. First, export the name of your Cloud project as an environment variable:

export PROJECT=<your_project_name>

Then run the following commands from Cloud Shell to create a service account:

gcloud iam service-accounts create my-account --display-name my-account
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:my-account@$PROJECT.iam.gserviceaccount.com --role=roles/bigquery.admin
gcloud iam service-accounts keys create key.json --iam-account=my-account@$PROJECT.iam.gserviceaccount.com
export GOOGLE_APPLICATION_CREDENTIALS=key.json

Now we're ready to send the text data to the NL API. To do that we'll write a Python script using the Python module for Google Cloud (note that you could accomplish the same thing from any language, there are many different cloud client libraries). Create a file called classify-text.py and copy the following into it, making sure to replace YOUR_PROJECT with the name of your project.

classify-text.py

from google.cloud import storage, language, bigquery

# Set up our GCS, NL, and BigQuery clients
storage_client = storage.Client()
nl_client = language.LanguageServiceClient()
# TODO: replace YOUR_PROJECT with your project name below
bq_client = bigquery.Client(project='YOUR_PROJECT')

dataset_ref = bq_client.dataset('news_classification')
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table('article_data')
table = bq_client.get_table(table_ref)

# Send article text to the NL API's classifyText method
def classify_text(article):
        response = nl_client.classify_text(
                document=language.types.Document(
                        content=article,
                        type=language.enums.Document.Type.PLAIN_TEXT
                )
        )
        return response


rows_for_bq = []
files = storage_client.bucket('text-classification-codelab').list_blobs()
print("Got article files from GCS, sending them to the NL API (this will take ~2 minutes)...")

# Send files to the NL API and save the result to send to BigQuery
for file in files:
        if file.name.endswith('txt'):
                article_text = file.download_as_string()
                nl_response = classify_text(article_text)
                if len(nl_response.categories) > 0:
                        rows_for_bq.append((article_text, nl_response.categories[0].name, nl_response.categories[0].confidence))

print("Writing NL API article data to BigQuery...")
# Write article text + category data to BQ
errors = bq_client.create_rows(table, rows_for_bq)
assert errors == []

We're ready to start classifying articles and importing them to BigQuery. The script takes about two minutes to complete, so while it's running we'll discuss what's happening. Run the script with the following:

python classify-text.py

We're using the google-cloud Python client library to access Cloud Storage, the NL API, and BigQuery. First we create a client for each service we'll be using, and then we create references to our BigQuery table. files is a reference to each of the BBC dataset files in the public bucket. We iterate through these files, download the articles as strings, and send each one to the NL API's in our classify_text function. For all articles where the NL API returns a category, we save the article and its category data to a rows_for_bq list. When we're done classifying each article, we insert our data into BigQuery using create_rows().

When your script has finished running, it's time to verify that the article data was saved to BigQuery. Navigate to your article_data table in the BigQuery web UI and click Query Table:

Enter the following query in the compose query box, replacing YOUR_PROJECT with your project name:

#standardSQL
SELECT * FROM `YOUR_PROJECT.news_classification.article_data`

You should see your data when the query completes. The category column has the name of the first category the NL API returned for our article, and confidence is a value between 0 and 1 indicating how confident the API is that it categorized the article correctly. We'll learn how to perform more complex queries on the data in the next step.

First, let's see which categories were most common in our dataset. Enter the following query, replacing YOUR_PROJECT and YOUR_DATASET with your project and dataset names:

#standardSQL
SELECT 
  category, 
  COUNT(*) c 
FROM 
  `YOUR_PROJECT.news_classification.article_data` 
GROUP BY 
  category 
ORDER BY 
  c DESC

You should see something like this in the query results:

Let's say we wanted to find the article returned for a more obscure category like /Arts & Entertainment/Music & Audio/Classical Music. We could write the following query:

#standardSQL
SELECT * FROM `YOUR_PROJECT.YOUR_DATASET.article_data`
WHERE category = "/Arts & Entertainment/Music & Audio/Classical Music"

Or, we could get only the articles where the NL API returned a confidence score greater than 90%:

#standardSQL
SELECT 
  article_text, 
  category 
FROM `YOUR_PROJECT.YOUR_DATASET.article_data` 
WHERE cast(confidence as float64) > 0.9

To perform more queries on your data, explore the BigQuery documentation. BigQuery also integrates with a number of visualization tools. To create visualizations of your categorized news data, check out the Data Studio quickstart for BigQuery. Here's an example of a Data Studio chart we could create for the query above:

You've learned how to use the Natural Language API's text classification method to classify news articles. You started by classifying one article, and then learned how to classify and analyze a large news dataset using the NL API with BigQuery.

What we've covered

Next Steps