In this lab you will ingest, transform, and analyze a taxi cab dataset using Google Cloud Dataprep. We will calculate key reporting metrics like the average number of passengers picked up in the past hour.

What you need

You must have completed Lab 0 and have the following:

What you learn

In this lab, you:

Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep.

Step 1

Open BigQuery at

Step 2

In the left side bar, hover over your project name

Step 3

Click on the down arrow that appears to the right

Step 4

In the drop down menu, select Create new dataset

Step 5

For Dataset ID, type `taxi_cab_reporting`

Step 6

Click OK (leave the remaining fields at their default values)

Now you have a new empty dataset that we can populate with tables.

Step 1

Open the Google Cloud Console at

Step 2

Open the the Products & Services menu in the left-side navigation

Step 3

Under Big Data, click on Dataprep

Step 4

Click Allow for Trifacta to access project data. Dataprep is provided in collaboration with Trifacta, a Google partner.

Step 5

Wait for Cloud Dataprep to initialize (less than a minute typically)

Step 1

In the Cloud Dataprep UI, click Create Flow

Step 2

Specify the following Flow details:



NYC Taxi Cab Data Reporting

Ingesting, Transforming, and Analyzing Taxi Data

Step 3

Click Import & Add Datasets

Step 4

In the data importer left side menu, click GCS (Google Cloud Storage)

Step 5

Click the Pencil Icon to edit the GCS path

Step 6

Paste in the 2015 taxi rides dataset CSV from Google Cloud Storage:


Step 7

Click on Go

Step 8

Repeat steps 5 - 7 with the 2016 data:


Step 9

Confirm two datasets are ready to be imported as shown:

Step 10

Click Import Datasets

Step 11

Wait for the datasets to be loaded into DataPrep.

The tool load a 10MB sample of the underlying data as well as connects to and ingests the original data source when the flow is ran.

Step 12

Confirm your Datasets now appear in Datasets (top level navigation)

Step 13

Under Flows (top level navigation), click on your existing Flow, and Add Datasets to Flow

Confirm your new Flow looks like the below

Step 14

Click on script icon tlc_yellow_trips_2016 to start building transformation steps

Step 15

Click Edit Recipe

Wait for Dataprep to load your data sample into the explorer view

Step 16

In the explorer view, find the trip_distance column and examine the histogram. Which bucket did the vast majority of rides fall under?

Now, let's combine our 2016 and 2015 datasets.

Step 17

In the top right, select the search icon, type UNION

Step 18

In the Union Page, click Add datasets and select tlc_yellow_trips_2015 and click Add datasets and Align by Name

Step 19

Confirm the union looks like below and then click Add to Recipe

Wait for Dataprep to Apply the Union

Now we have a single table with 2016 and 2015 taxicab data.

Step 20

Examine the pickup_time histogram, are there any hours were business is slower than usual?

In our sample, the early morning hours (3 - 5am) had fewer taxicab pickups

Examine the pickup_day histogram. Which months and years of data do we have in our dataset?

Examine the dropoff_day histogram. Is there anything unusual about it when compared to pickup_day? Why are there records for January 2017?

Next, we want to concatenate our date and time fields into a single timestamp

Step 21

In the Choose Transformation box, type MERGE

For columns to merge, specify pickup_day and pickup_time

For delimiter put a single space

Name the new column pickup_datetime

Click Add to Recipe

Confirm your new field is properly registering now as a datetime datatype (clock icon)

Step 22

Next, we want to create a new formula column to count the average amount of passengers in the last hour. To do that, we need to create to get hourly data and perform a calculation.

In the search box, type formula

In the formula, paste the following which will truncate the pickup time to just the hour:


Specify the New Column as hour_pickup_datetime

Confirm the new derived column is shown correctly in the preview

Click Add to Recipe

Step 23

In order to get the field properly recognized as a DATETIME data type, we are going to add back zero minutes and zero seconds through a MERGE concatenation.

Add a new transformation in the Choose Transformation box, type MERGE

Merge hour_pickup_datetime and '0000'

For new column name put pickup_hour

Click Add to Recipe

We now have our taxicab hourly pickup column. Next, we will calculate the average count of passengers over the past hour. We will do this through aggregations and a rolling window average function.

Step 25

Add a new transformation, type and choose Group By

For Group by, specify:


For Values, specify the below:








Click Add to Recipe

We now have our summary statistics table!

Step 26

Explore the average_fare_amount histogram. Is there a range of fares that are most common?

In our sample, most NYC cab fares are in the $18-19 range

Next, we want to calculate a rolling window of average fares over the past 3 hours.

Step 27

Add a new transformation, type and choose Window

Copy in the below formula which computes the rolling average of passenger count for the last hour


ROLLINGAVERAGE(average_fare_amount, 3, 0)

Order By


Note that we are sorting recent taxicab rides first (the negative sign -pickup_hour indicates descending order) and operating over a rolling 3 hour period.

Step 28

Find the field you just created (titled window), rename it to average_3hr_rolling_fare

Click Add to Recipe

Step 29

Expand the recipe icon in the top right and confirm your final recipe

Step 30

Click Run Job

Step 31

In Publishing Actions, under Settings, edit the path by clicking the pencil icon

Choose BigQuery and choose your taxi_cab_reporting BigQuery dataset where you want to create the output table

Choose Create a new table

Name the table tlc_yellow_trips_reporting

Choose Create new table every run

Select Update

Step 32

Select Run Job

Step 33

Go to Cloud Console and navigate to Cloud Dataflow to see the job running

Step 34

While your dataflow job runs, you can see the data results by running this pre-ran query in BigQuery:

  FORMAT("$%.2f",ROUND(average_3hr_rolling_fare,2)) AS avg_recent_fare,
  ROUND(average_trip_distance,2) AS average_trip_distance_miles,
  FORMAT("%'d",sum_passenger_count) AS total_passengers_by_hour
  pickup_hour DESC;

Congratulations! You have now built a Dataflow pipeline using the Cloud Dataprep UI.