In this lab you will ingest, transform, and analyze a taxi cab dataset using Google Cloud Dataprep. We will calculate key reporting metrics like the average number of passengers picked up in the past hour.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you:

Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep.

Skip this section if you already have a GCS Bucket

Step 1

Open the Google Cloud Console at console.cloud.google.com

Step 2

Go to Storage in the Products and Services menu (left-side navigation)

Step 3

Click Create Bucket (or use an existing bucket)

Step 4

In the Create a bucket window that will appear, add a unique bucket name and leave the remaining settings at their default values.

Step 5

Click Create

Step 6

You now have a Cloud Storage Bucket which we will be using to store raw data for ingestion into Google BigQuery later and for storing Cloud Dataprep settings.

Step 1

Open BigQuery at bigquery.cloud.google.com


Step 2

In the left side bar, hover over your project name


Step 3

Click on the down arrow that appears to the right


Step 4

In the drop down menu, select Create new dataset


Step 5

For Dataset ID, type `taxi_cab_reporting`


Step 6

Click OK (leave the remaining fields at their default values)

Now you have a new empty dataset that we can populate with tables.

Step 1

Open the Google Cloud Console at console.cloud.google.com

Step 2

Open the the Products & Services menu in the left-side navigation

Step 3

Under Big Data, click on Dataprep


Step 4

Click Allow for Trifacta to access project data. Dataprep is provided in collaboration with Trifacta, a Google partner.

Step 5

Click on the bucket name that you created earlier which will appear in the list.

Step 6

Click Use this Folder

Step 7

Wait for Cloud Dataprep to initialize (less than a minute typically)

Step 1

In the Cloud Dataprep UI, click Create Flow

Step 2

Specify the following Flow details:

Name

Description

NYC Taxi Cab Data Reporting

Ingesting, Transforming, and Analyzing Taxi Data

Step 3

Click Import & Add Datasets

Step 4

In the data importer left side menu, click GCS (Google Cloud Storage)


Step 5

Click the Pencil Icon to edit the GCS path


Step 6

Paste in the 2015 taxi rides dataset CSV from Google Cloud Storage:

gs://asl-ml-immersion/nyctaxicab/tlc_yellow_trips_2015.csv

Step 7

Click on Go

Step 8

Repeat steps 5 - 7 with the 2016 data:

gs://asl-ml-immersion/nyctaxicab/tlc_yellow_trips_2016.csv

Step 9

Confirm two datasets are ready to be imported as shown:


Step 10

Click Import Datasets

Step 11

Wait for the datasets to be loaded into DataPrep.

The tool load a 10MB sample of the underlying data as well as connects to and ingests the original data source when the flow is ran.

Step 12

Confirm your Datasets now appear in Datasets (top level navigation)


Step 13

Under Flows (top level navigation), click on your existing Flow, and Add Datasets to Flow

Confirm your new Flow looks like the below

Step 14

Click on script icon tlc_yellow_trips_2016 to start building transformation steps

Step 15

Click Edit Recipe

Wait for Dataprep to load your data sample into the explorer view

Step 16

In the explorer view, find the trip_distance column and examine the histogram. Which bucket did the vast majority of rides fall under?

Now, let's combine our 2016 and 2015 datasets.

Step 17

In the lower left, under the Choose Transformation box, type UNION

Step 18

In the Union Page, click Add datasets and select tlc_yellow_trips_2015 and click Add datasets and Align by Name

Step 19

Confirm the union looks like below and then click Add to Recipe

Wait for Dataprep to Apply the Union

Now we have a single table with 2016 and 2015 taxicab data.

Step 20

Examine the pickup_time histogram, are there any hours were business is slower than usual?

In our sample, the early morning hours (3 - 5am) had fewer taxicab pickups

Examine the pickup_day histogram. Which months and years of data do we have in our dataset?

Examine the dropoff_day histogram. Is there anything unusual about it when compared to pickup_day? Why are there records for January 2017?

Next, we want to concatenate our date and time fields into a single timestamp

Step 21

In the Choose Transformation box, type MERGE

For columns to merge, specify pickup_day and pickup_time

For delimiter put a single space

Name the new column pickup_datetime

Click Add to Recipe

Confirm your new field is properly registering now as a datetime datatype (clock icon)

Step 22

Next, we want to create a new derived column to count the average amount of passengers in the last hour. To do that, we need to create to get hourly data and perform a calculation.

In the transformer input box, type DERIVE

In the formula, paste the following which will truncate the pickup time to just the hour:

DATEFORMAT(pickup_datetime,"yyyyMMddHH")

Leave Group By and Order By blank

Specify the New Column as hour_pickup_datetime

Confirm the new derived column is shown correctly in the preview

Click Add to Recipe

Step 23

In order to get the field properly recognized as a DATETIME data type, we are going to add back zero minutes and zero seconds through a MERGE concatenation.

Add a new transformation in the Choose Transformation box, type MERGE

Merge hour_pickup_datetime and '0000'

Click Add to Recipe

Step 24

Select the drop down arrow next to your newly created column name and click Rename

Rename the column to pickup_hour

We now have our taxicab hourly pickup column. Next, we will calculate the average count of passengers over the past hour. We will do this through aggregations and a rolling window average function.

Step 25

Add a new transformation, type and choose Aggregate

For Functions, specify the below:

SUM(passenger_count)

AVERAGE(passenger_count)

SUM(trip_distance)

AVERAGE(trip_distance)

SUM(fare_amount)

AVERAGE(fare_amount)

MAX(fare_amount)

For Group by, specify:

pickup_hour

Click Add to Recipe

We now have our summary statistics table!

Step 26

Explore the average_fare_amount histogram. Is there a range of fares that are most common?

In our sample, most NYC cab fares are in the $18-19 range

Next, we want to calculate a rolling window of average fares over the past 3 hours.

Step 27

Add a new transformation, type and choose Window

Copy in the below formula which computes the rolling average of passenger count for the last hour

Formula:

ROLLINGAVERAGE(average_fare_amount, 3, 0)

Order By

-pickup_hour

Note that we are sorting recent taxicab rides first (the negative sign -pickup_hour indicates descending order) and operating over a rolling 3 hour period.

Step 28

Find the field you just created (titled window), rename it to average_3hr_rolling_fare

Click Add to Recipe

Step 29

Expand the script icon in the right side and confirm your final recipe

Step 30

Click Run Job

Step 31

In Publishing Actions page, under Settings, edit the path by clicking the pencil icon

Choose BigQuery and choose your taxi_cab_reporting BigQuery dataset where you want to create the output table

Choose Create a new table

Name the table tlc_yellow_trips_reporting

Choose Replace the data every run

Select Save Settings

Step 32

Select Run Job

Step 33

View the Cloud Dataflow Job by hovering on the Job Card and selecting the icon

Step 34

While your dataflow job runs, you can see the data results by running this pre-ran query in BigQuery:

#standardSQL
SELECT
  pickup_hour,
  FORMAT("$%.2f",ROUND(average_3hr_rolling_fare,2)) AS avg_recent_fare,
  ROUND(average_trip_distance,2) AS average_trip_distance_miles,
  FORMAT("%'d",sum_passenger_count) AS total_passengers_by_hour
FROM
  `asl-ml-immersion.demo.nyc_taxi_reporting`
ORDER BY
  pickup_hour DESC;

Congratulations! You have now built a Dataflow pipeline using the Cloud Dataprep UI.