在 BigQuery Studio 笔记本中使用 Apache Spark 进行无服务器数据转换

1. 简介

在此实验中,您将学习如何在熟悉的 BigQuery Studio 界面中利用 Apache Spark 的强大功能进行数据转换。您将从 BigQuery 读取数据,使用 PySpark 执行数据清理和转换,并将结果写回新的 BigQuery 表,所有这些操作都将在一个笔记本中完成。

在此 Codelab 中,您将采用以下分步方法:

  1. 准备好您的 Google Cloud 项目,并在其中启用所有必需的 API
  2. 为临时文件夹创建 GCS 存储分区
  3. 导入运行 Apache Spark 所需的库
  4. 使用 BigQuery 连接器初始化 Spark 会话
  5. 从 BigQuery 公共数据集读取 Google Analytics 示例数据
  6. 使用按设备浏览器汇总的数据(基本指标)转换数据
  7. 使用包含收入计算的流量来源分析来转换数据
  8. 使用地理位置分析转换数据
  9. 将转换后的数据写入 BigQuery 表

架构概览

186f332da87c2ef3.png

前提条件

  • 启用了结算功能的 Google Cloud Platform (GCP) 项目。
  • 您的 GCP 项目中已启用 BigQuery API 和 BigQuery Connection API。
  • 具备 SQL 和 Python 基础知识。

学习内容

  • 如何在 BigQuery Studio 笔记本中使用 Apache Spark 提取数据
  • 如何在 BigQuery Studio 笔记本中使用 Apache Spark 转换或汇总数据
  • 如何在 BigQuery Studio 笔记本中使用 Apache Spark 在转换或汇总数据后写入数据

所需条件

  • Chrome 网络浏览器
  • Gmail 账号
  • 启用了结算功能的 Cloud 项目

2. 基本设置和要求

自定进度的环境设置

  1. 登录 Google Cloud 控制台,然后创建一个新项目或重复使用现有项目。如果您还没有 Gmail 或 Google Workspace 账号,则必须创建一个

fbef9caa1602edd0.png

97bdebccea2ba4be.png

5e3ff691252acf41.png

  • 项目名称是此项目参与者的显示名称。它是 Google API 尚未使用的字符串。您可以随时对其进行更新。
  • 项目 ID 在所有 Google Cloud 项目中是唯一的,并且是不可变的(一经设置便无法更改)。Cloud 控制台会自动生成一个唯一字符串;通常情况下,您无需关注该字符串。在大多数 Codelab 中,您都需要引用项目 ID(通常用 PROJECT_ID 标识)。如果您不喜欢生成的 ID,可以再随机生成一个 ID。或者,您也可以尝试自己的项目 ID,看看是否可用。完成此步骤后便无法更改该 ID,并且此 ID 在项目期间会一直保留。
  • 此外,还有第三个值,即部分 API 使用的项目编号,供您参考。如需详细了解所有这三个值,请参阅文档
  1. 接下来,您需要在 Cloud 控制台中启用结算功能,以便使用 Cloud 资源/API。运行此 Codelab 应该不会产生太多的费用(如果有的话)。若要关闭资源以避免产生超出本教程范围的结算费用,您可以删除自己创建的资源或删除项目。Google Cloud 新用户符合参与 300 美元免费试用计划的条件。

3. 准备工作

启用 API

在使用 BigQuery Studio 笔记本之前,我们必须启用以下 API:

  • Compute Engine API
  • Dataform API
  • Vertex AI API

如需手动启用,请前往 BigQuery。在编辑器窗格的标签页栏中,点击 + 号旁边的下拉箭头,将指针悬停在笔记本上,然后选择 BigQuery 模板空白笔记本Spark 模板

2073fec24366e7c4.png

在“启用核心功能 API”窗口中,点击“在 BigQuery Unified API 中启用”

44dc4e398b4e8fb5.png

完成后,启用该功能,然后点击“关闭”。如需了解详情,请参阅启用 BigQuery Studio 以进行资产管理

4. 读取公共数据集

首先,我们将创建一个临时使用的 GCS 存储分区,以便在 BigQuery Studio 笔记本中运行 Spark。

  1. 在 Google Cloud 控制台中,前往 BigQuery
  2. 在编辑器窗格的标签页栏中,点击 + 号旁边的下拉箭头,将指针悬停在笔记本上,然后选择空白笔记本dc05f38b85ba6844.png
  3. 点击代码单元格,然后输入以下 CLI 脚本以创建 GCS 存储分区,接着点击“运行单元格”按钮或按 Shift + Enter
!gsutil mb -p <your_project_id> -c STANDARD -l US gs://ioxid2025-<your_project_id>

根据您在创建 Google Cloud 项目时选择的内容,更新 <your_project_id> 的值。将 <your_project_id> 的值更新为您的项目 ID,以创建唯一的 GCS 存储分区名称。之后,点击“运行单元”按钮或按 Shift + Enter 运行代码单元

接下来,我们将启动 Spark 会话。在此 Codelab 中,我们将使用 SparkSession 库,不过我们也可以使用 DataprocSession 来利用 Dataproc 功能在 BigQuery Studio 笔记本中运行 Spark

  1. 点击代码单元格,然后输入以下 CLI 脚本以初始化 Spark 会话。点击“运行单元”按钮或按 Shift + Enter
# Import required libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, sum, count, countDistinct, when, expr, date_format 
from pyspark.sql.types import DecimalType 

# Initialize Spark session with BigQuery connector 
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, countDistinct, when, expr, date_format
from pyspark.sql.types import DecimalType

# Initialize Spark session with BigQuery connector
spark = SparkSession.builder \
 .appName("Google Analytics ETL with Apache Spark") \
 .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.0") \
 .getOrCreate()

spark

预期输出

SparkSession - in-memory
SparkContext
Spark UI
Version
v3.5.4
Master
local[*]
AppName
Google Analytics ETL with Apache Spark
  1. 点击代码单元格,然后输入以下 CLI 脚本以设置 GCP 项目和 GCS 临时存储分区。
# Set GCP project and temporary bucket 
project_id = "your-gcp-project-id"  # Replace with your GCP project ID 
bucket = "your-gcs-bucket"  # Replace with your GCS bucket for temporary files spark.conf.set("temporaryGcsBucket", bucket)

Google Analytics 示例数据库

Google Analytics 示例数据库通过 Google Cloud 的公共数据集计划在 BigQuery 上提供。该数据集提供来自 Google Merchandise Store (销售 Google 品牌商品的真实电子商务商店)的 12 个月(2016 年 8 月至 2017 年 8 月)经过模糊处理的 Google Analytics 360 数据,并且已加载到 BigQuery 中。这是分析商家数据并了解使用 BigQuery 分析 Analytics 360 数据有何益处的绝佳方式。详细了解数据

该数据是电子商务网站将看到的典型数据,包括以下信息:

  • 流量来源数据:有关网站访问者来源的信息,包括有关自然流量、付费搜索流量和展示广告流量的数据
  • 内容数据:有关用户在网站上的行为的信息,例如访问者查看的网页的网址、他们与内容进行互动的方式等。
  • 交易数据:Google Merchandise Store 网站上的交易的相关信息。

运行以下代码以在 Apache Spark 中显示前 5 个示例数据

# EXTRACT: Read data from BigQuery
print("Extracting data from BigQuery...")
ga_df = spark.read.format("bigquery") \
   .option("table", "bigquery-public-data.google_analytics_sample.ga_sessions_20170801") \
   .load()

# Show schema sample data
print("Sample data:")
ga_df.show(5, truncate=False)

预期输出

Extracting data from BigQuery...
Sample data:
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
|visitorId|visitNumber|visitId   |visitStartTime|date    |totals                                                         |trafficSource                                                                                                                                                                          |device                                                                                                                                                                                                                                                                                                                                                                                                                      |geoNetwork                                                                                                                                                                                                                                                                     |customDimensions    |hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |fullVisitorId      |userId|clientId|channelGrouping|socialEngagementType|
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
|NULL     |1          |1501591568|1501591568    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{NULL, (not set), (direct), (none), NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}                         |{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Europe, Southern Europe, Greece, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, tellas.gr, not available in demo dataset, not available in demo dataset, not available in demo dataset}          |[]                  |[{1, 0, 5, 46, NULL, true, true, true, https://www.google.gr/, {/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /bags/, /google+zipper+front+sports+bag.axd, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Bags, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]     |3418334011779872055|NULL  |NULL    |Organic Search |Not Socially Engaged|
|NULL     |2          |1501589647|1501589647    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 1}|{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Asia, Southern Asia, India, Maharashtra, (not set), Mumbai, not available in demo dataset, unknown.unknown, not available in demo dataset, not available in demo dataset, not available in demo dataset}                                                                      |[{4, APAC}]         |[{1, 0, 5, 14, NULL, true, true, true, https://analytics.google.com/analytics/web/, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                                        |2474397855041322408|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501616621|1501616621    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Europe, Northern Europe, United Kingdom, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, as9105.com, not available in demo dataset, not available in demo dataset, not available in demo dataset} |[{4, EMEA}]         |[{1, 0, 12, 43, NULL, true, true, true, https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]|5870462820713110108|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501601200|1501601200    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Firefox, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop}|{Americas, Northern America, United States, Texas, Dallas-Ft. Worth TX, Dallas, not available in demo dataset, h5colo.com, not available in demo dataset, not available in demo dataset, not available in demo dataset}                                                        |[{4, North America}]|[{1, 0, 8, 26, NULL, true, true, true, https://analytics.google.com/analytics/web/, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                                        |9397809171349480379|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501615525|1501615525    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), adwords.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}  |{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Americas, Northern America, United States, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, (not set), not available in demo dataset, not available in demo dataset, not available in demo dataset}|[{4, North America}]|[{1, 0, 12, 25, NULL, true, true, true, https://adwords.google.com/analytics/web/?__o=cues&authuser=0, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                     |6089902943184578335|NULL  |NULL    |Referral       |Not Socially Engaged|
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
only showing top 5 rows

5. 按设备浏览器汇总数据(基本指标)

此代码按设备浏览器汇总 Google Analytics 数据,计算各种指标(例如总会话数、访问次数、命中次数、网页浏览次数、跳出次数、网站停留时间、唯一身份访问者数和每次访问的平均时间),然后重命名列并对结果进行排序。

print("Transformation 1: Aggregating by device browser...")
device_agg = ga_df.groupBy("device.browser") \
   .agg(
       count("*").alias("total_sessions"),
       sum("totals.visits").alias("total_visits"),
       sum("totals.hits").alias("total_hits"),
       sum("totals.pageviews").alias("total_pageviews"),
       sum("totals.bounces").alias("total_bounces"),
       sum("totals.timeOnSite").alias("total_time_on_site"),
       countDistinct("fullVisitorId").alias("unique_visitors"),
       (sum("totals.timeOnSite")/sum("totals.visits")).alias("avg_time_per_visit")
   ) \
   .withColumnRenamed("browser", "device_browser") \
   .orderBy("total_sessions", ascending=False)

# Show sample transformed data
print("\nDevice Aggregation Sample:")
device_agg.show(5)

预期输出

Transformation 1: Aggregating by device browser...

Device Aggregation Sample:
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
|   device_browser|total_sessions|total_visits|total_hits|total_pageviews|total_bounces|total_time_on_site|unique_visitors|avg_time_per_visit|
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
|           Chrome|          1900|        1900|     10896|           8956|          870|            354691|           1689|186.67947368421054|
|           Safari|           397|         397|      1260|           1137|          218|             43150|            364|108.69017632241814|
|          Firefox|           101|         101|       390|            343|           59|             10659|             95|105.53465346534654|
|Internet Explorer|            54|          54|       107|            102|           36|              5589|             50|             103.5|
|             Edge|            23|          23|        63|             55|           12|              2623|             23|114.04347826086956|
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
only showing top 5 rows

6. 通过收入计算分析流量来源

此代码按流量来源和媒介分析 Google Analytics 数据,计算会话数、交易次数、总收入、每次会话收入和唯一身份访问者等指标,然后重命名列并按总收入对结果进行排序。

print("Transformation 2: Analyzing traffic sources...")
traffic_source_agg = ga_df.groupBy("trafficSource.source", "trafficSource.medium") \
   .agg(
       count("*").alias("session_count"),
       sum(when(col("totals.transactions").isNotNull(), 1).otherwise(0)).alias("transactions"),
       sum("totals.totalTransactionRevenue").cast(DecimalType(20,2)).alias("total_revenue"),
       (sum("totals.totalTransactionRevenue")/count("*")).cast(DecimalType(10,2)).alias("revenue_per_session"),
       countDistinct("fullVisitorId").alias("unique_visitors")
   ) \
   .withColumnRenamed("source", "traffic_source") \
   .withColumnRenamed("medium", "traffic_medium") \
   .orderBy("total_revenue", ascending=False)

print("\nTraffic Source Aggregation Sample:")
traffic_source_agg.show(5)

预期输出

Transformation 2: Analyzing traffic sources...

Traffic Source Aggregation Sample:
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
|      traffic_source|traffic_medium|session_count|transactions|total_revenue|revenue_per_session|unique_visitors|
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
|            (direct)|        (none)|         2166|          42|8872040000.00|         4096048.01|           1943|
|     mail.google.com|      referral|            2|           1|  17960000.00|         8980000.00|              2|
|       google.com.tw|      referral|            1|           0|         NULL|               NULL|              1|
|analytics.google.com|      referral|           57|           0|         NULL|               NULL|             53|
|           quora.com|      referral|            6|           0|         NULL|               NULL|              5|
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
only showing top 5 rows

7. 地理位置分析

此代码按国家/地区对 Google Analytics 数据执行地理位置分析,计算会话数、总网页浏览量、网站总时长、每次会话的平均时长和唯一身份访问者数,然后重命名列并按会话数排序。

print("Transformation 3: Geographic analysis...")
geo_agg = ga_df.groupBy("geoNetwork.country", "geoNetwork.region") \
   .agg(
       count("*").alias("session_count"),
       sum("totals.pageviews").alias("total_pageviews"),
       sum("totals.timeOnSite").alias("total_time_on_site"),
       (sum("totals.timeOnSite")/count("*")).alias("avg_time_per_session"),
       countDistinct("fullVisitorId").alias("unique_visitors")
   ) \
   .withColumnRenamed("country", "country") \
   .withColumnRenamed("region", "region") \
   .orderBy("session_count", ascending=False)

print("\nGeographic Aggregation Sample:")
geo_agg.show(5)

预期输出

Transformation 3: Geographic analysis...

Geographic Aggregation Sample:
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
|       country|              region|session_count|total_pageviews|total_time_on_site|avg_time_per_session|unique_visitors|
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
| United States|not available in ...|          564|           2326|             97829|  173.45567375886526|            494|
| United States|          California|          420|           3102|            116563|   277.5309523809524|            347|
| United States|            New York|          109|            845|             39976|   366.7522935779817|             84|
|United Kingdom|not available in ...|           82|            161|              7791|   95.01219512195122|             79|
|         India|not available in ...|           62|            139|              2869|  46.274193548387096|             61|
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
only showing top 5 rows

8. 基于时间的分析

此代码通过从 visitStartTime 列中提取小时数来执行基于时间的分析,然后按小时对数据进行分组,以计算每个小时的会话数、交易次数、总收入和总网页浏览量,最后按小时对结果进行排序。

print("Transformation 4: Time-based analysis...")
hourly_agg = ga_df.withColumn("hour", date_format(col("visitStartTime").cast("timestamp"), "H")) \
   .groupBy("hour") \
   .agg(
       count("*").alias("session_count"),
       sum("totals.transactions").alias("transactions"),
       sum("totals.totalTransactionRevenue").cast(DecimalType(20,2)).alias("total_revenue"),
       sum("totals.pageviews").alias("total_pageviews")
   ) \
   .orderBy("hour")

print("\nHourly Aggregation Sample:")
hourly_agg.show(5)

预期输出

Transformation 4: Time-based analysis...

Hourly Aggregation Sample:
+----+-------------+------------+-------------+---------------+
|hour|session_count|transactions|total_revenue|total_pageviews|
+----+-------------+------------+-------------+---------------+
|   0|           87|        NULL|         NULL|            372|
|   1|          102|        NULL|         NULL|            494|
|  10|           67|        NULL|         NULL|            149|
|  11|           73|        NULL|         NULL|            167|
|  12|           99|        NULL|         NULL|            313|
+----+-------------+------------+-------------+---------------+
only showing top 5 rows

9. 将结果写入 BigQuery 表

此代码使用直接写入方法将四个汇总数据框(device_aggtraffic_source_agggeo_agghourly_agg)导出到 Google BigQuery 中的单独表格,如果存在现有表格,则覆盖这些表格。

# Write to BigQuery tables
print("\nLoading data to BigQuery...")

# Set output tables
device_output_table = f"{project_id}.analytics_sample.device_aggregation"
traffic_output_table = f"{project_id}.analytics_sample.traffic_source_aggregation"
geo_output_table = f"{project_id}.analytics_sample.geo_aggregation"
hourly_output_table = f"{project_id}.analytics_sample.hourly_aggregation"

dataset_id = "demo"  # Replace with your BigQuery dataset ID 

# Set BigQuery output table
device_output_table = f"{project_id}.{dataset_id}.device_aggregation"
traffic_output_table = f"{project_id}.{dataset_id}.traffic_source_aggregation"
geo_output_table = f"{project_id}.{dataset_id}.geo_aggregation"
hourly_output_table = f"{project_id}.{dataset_id}.hourly_aggregation"

# Write each DataFrame to BigQuery
device_agg.write \
   .format("bigquery") \
   .option("table", device_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

traffic_source_agg.write \
   .format("bigquery") \
   .option("table", traffic_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

geo_agg.write \
   .format("bigquery") \
   .option("table", geo_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

hourly_agg.write \
   .format("bigquery") \
   .option("table", hourly_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

检查 BigQuery 中的输出表,确保数据在经过一些转换后已成功保存

6227a951f1d1d4e6.png

device_aggregation 表

da7791f5d15f67be.png

地理位置汇总表

76ec9d9dc9929c1c.png

hourly_aggregation 表

bcf627ef6a13c8cc.png

traffic_source_aggregation

f373d64ace332074.png

10. 编排 BigQuery Studio 笔记本代码(可选)

您可以通过以下方式编排 BigQuery Studio 笔记本代码:

在此 Codelab 中,我们将使用 Google Cloud 控制台中的“安排笔记本”代码。

  1. 笔记本工具栏中,点击安排
  2. 安排笔记本窗格中的时间表名称字段中,输入时间表的名称。
  3. 身份验证部分,使用您的 Google 账号用户凭证或服务账号授权笔记本。
  4. 如需使用您的 Google 账号用户凭证(预览版),请选择使用我的用户凭证执行
  5. 如需使用服务账号,请选择使用所选服务账号执行,然后选择一个服务账号。
  6. 笔记本选项部分的运行时模板字段中,选择一个 Colab 笔记本运行时模板或默认运行时规范。如需详细了解如何创建 Colab 笔记本运行时模板,请参阅创建运行时模板
  7. Cloud Storage 存储分区字段中,点击浏览,然后选择或创建 Cloud Storage 存储分区。所选服务账号必须在所选存储分区中获得 Storage Admin (roles/storage.admin) IAM 角色。如需了解详情,请参阅启用笔记本时间安排
  8. 时间表频率部分中,执行以下操作:
  9. 重复频率菜单中,选择安排的笔记本运行的频率。
  10. 时间字段中,为安排的笔记本运行输入时间。
  11. 时区菜单中,选择时间表的时区。
  12. 点击创建时间表。 如果您选择了使用我的用户凭证执行作为身份验证方法,则必须授权您的 Google 账号预览版)。

f3d87c9a3408b08d.png

11. 清理

为避免系统因本 Codelab 中使用的资源向您的 Google Cloud 账号收取费用,请按照以下步骤操作:

  1. 在 Google Cloud 控制台中,前往管理资源页面。
  2. 在项目列表中,选择要删除的项目,然后点击删除
  3. 在对话框中输入项目 ID,然后点击关停以删除项目。

12. 恭喜

您已在 BigQuery Studio 笔记本中利用 Apache Spark 无服务器功能完成了数据转换和分析。在此过程中,您探索了 Google Cloud 的公共数据集,在 BigQuery 工作室笔记本中通过 Apache Spark 执行了无服务器 ETL,并编排了 BigQuery 工作室笔记本。真了不起!

后续步骤:

  • 使用服务账号编排笔记本,以实现自动化。
  • 添加了在运行 ETL 作业时监控时长的脚本。
  • 部署笔记本
  • 使用 DataprocSparkSession 在 BigQuery Studio Notebook 中利用 Serverless Dataproc 的真正分布式 Apache Spark 功能。
  • 在 BigQuery 工作室中创建 Apache Spark 的存储过程。这样,您就可以应用 OOP 原则来构建 PySpark 代码,从而提高代码的组织性、可重用性和可维护性。

参考编号