在 BigQuery Studio 筆記本中,使用 Apache Spark 進行無伺服器資料轉換

1. 簡介

在本程式碼研究室中,您將瞭解如何在 BigQuery Studio 熟悉的介面中,運用 Apache Spark 的強大功能轉換資料。您將從 BigQuery 讀取資料、使用 PySpark 執行資料清理和轉換作業,然後將結果寫回新的 BigQuery 資料表,所有作業都會在單一筆記本中完成。

在本程式碼研究室中,您將逐步完成下列步驟:

  1. 準備 Google Cloud 專案,並啟用所有必要 API
  2. 為暫時資料夾建立 GCS 值區
  3. 匯入執行 Apache Spark 時所需的程式庫
  4. 使用 BigQuery 連接器初始化 Spark 工作階段
  5. 從 BigQuery 公開資料集讀取 Google Analytics 樣本資料
  6. 根據裝置瀏覽器匯總資料轉換資料 (基本指標)
  7. 使用流量來源分析和收益計算功能轉換資料
  8. 使用地理分析功能轉換資料
  9. 將轉換後的資料寫入 BigQuery 資料表

架構總覽

186f332da87c2ef3.png

必要條件

  • 啟用計費功能的 Google Cloud Platform (GCP) 專案。
  • 在 GCP 專案中啟用 BigQuery API 和 BigQuery Connection API。
  • 具備 SQL 和 Python 的基本知識。

課程內容

  • 如何在 BigQuery Studio 筆記本中使用 Apache Spark 擷取資料
  • 如何在 BigQuery Studio Notebook 中使用 Apache Spark 轉換或匯總資料
  • 如何在 BigQuery Studio 筆記本中使用 Apache Spark 轉換或匯總資料後寫入資料

軟硬體需求

  • Chrome 網路瀏覽器
  • Gmail 帳戶
  • 已啟用計費功能的 Cloud 專案

2. 基本設定和需求

自行設定環境

  1. 登入 Google Cloud 控制台,然後建立新專案或重複使用現有專案。如果沒有 Gmail 或 Google Workspace 帳戶,請建立帳戶

fbef9caa1602edd0.png

97bdebccea2ba4be.png

5e3ff691252acf41.png

  • 專案名稱是這個專案參與者的顯示名稱。這是 Google API 未使用的字元字串。你隨時可以更新。
  • 專案 ID 在所有 Google Cloud 專案中不得重複,且設定後即無法變更。Cloud 控制台會自動產生專屬字串,通常您不需要理會該字串。在大多數程式碼研究室中,您需要參照專案 ID (通常標示為 PROJECT_ID)。如果您不喜歡產生的 ID,可以產生另一個隨機 ID。你也可以嘗試使用自己的名稱,看看是否可用。完成這個步驟後就無法變更,且專案期間都會維持這個設定。
  • 請注意,部分 API 會使用第三個值,也就是專案編號。如要進一步瞭解這三種值,請參閱說明文件
  1. 接著,您需要在 Cloud 控制台中啟用帳單,才能使用 Cloud 資源/API。完成這個程式碼研究室的費用不高,甚至可能完全免費。如要關閉資源,避免在本教學課程結束後繼續產生費用,您可以刪除建立的資源或專案。Google Cloud 新使用者可參加價值$300 美元的免費試用計畫。

3. 事前準備

啟用 API

使用 BigQuery Studio 筆記本前,請務必啟用下列 API:

  • Compute Engine API
  • Dataform API
  • Vertex AI API

如要手動啟用,請前往 BigQuery。在編輯器窗格的索引標籤列中,按一下「+」符號旁的箭頭下拉式選單,將指標懸停在「Notebook」上,然後選取「BigQuery template」或「Empty Notebook」或「Spark template」

2073fec24366e7c4.png

在「啟用核心功能 API」視窗中,按一下「啟用 BigQuery Unified API」

44dc4e398b4e8fb5.png

完成後,請啟用並按一下「關閉」。詳情請參閱「啟用 BigQuery Studio 進行資產管理」一文。

4. 讀取公開資料集

首先,我們會建立暫時使用的 GCS 儲存空間,以便在 BigQuery Studio 筆記本中執行 Spark。

  1. 在 Google Cloud 控制台中,前往 BigQuery
  2. 在編輯器窗格的分頁列中,按一下「+」符號旁的箭頭下拉式選單,將指標懸停在「Notebook」上,然後選取「Empty Notebook」dc05f38b85ba6844.png
  3. 按一下程式碼儲存格,然後輸入下列 CLI 指令碼來建立 GCS 儲存空間,接著按一下「執行儲存格」按鈕或按下 Shift + Enter
!gsutil mb -p <your_project_id> -c STANDARD -l US gs://ioxid2025-<your_project_id>

根據您建立 Google Cloud 專案時選取的項目,更新 <your_project_id> 的值。更新 <your_project_id> 的值,並使用專案 ID 建立 GCS 不重複 bucket 名稱。然後點選「執行儲存格」按鈕或按下 Shift + Enter 鍵,執行程式碼儲存格

接著,我們會啟動 Spark 工作階段。在本程式碼研究室中,我們會使用 SparkSession 程式庫,但您也可以使用 DataprocSession,在 BigQuery Studio 筆記本中運用 Dataproc 功能執行 Spark

  1. 按一下程式碼儲存格,然後輸入下列 CLI 指令碼,初始化 Spark 工作階段。按一下「執行儲存格」按鈕,或按下 Shift + Enter 鍵。
# Import required libraries 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col, sum, count, countDistinct, when, expr, date_format 
from pyspark.sql.types import DecimalType 

# Initialize Spark session with BigQuery connector 
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, countDistinct, when, expr, date_format
from pyspark.sql.types import DecimalType

# Initialize Spark session with BigQuery connector
spark = SparkSession.builder \
 .appName("Google Analytics ETL with Apache Spark") \
 .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.0") \
 .getOrCreate()

spark

預期輸出內容:

SparkSession - in-memory
SparkContext
Spark UI
Version
v3.5.4
Master
local[*]
AppName
Google Analytics ETL with Apache Spark
  1. 點選程式碼儲存格,然後輸入下列 CLI 指令碼,設定 GCP 專案和 GCS 暫時值區。
# Set GCP project and temporary bucket 
project_id = "your-gcp-project-id"  # Replace with your GCP project ID 
bucket = "your-gcs-bucket"  # Replace with your GCS bucket for temporary files spark.conf.set("temporaryGcsBucket", bucket)

Google Analytics 範例資料庫

Google Analytics 範例資料庫是透過 Google Cloud 的公開資料集計畫,在 BigQuery 上提供。這個資料集提供 12 個月 (2016 年 8 月至 2017 年 8 月) 的 BigQuery 資料,內含來自 Google 商品網路商店 (銷售 Google 品牌商品的真實電子商務商店) 的 Google Analytics 360 資料 (已模糊處理)。如此一來,您就能輕鬆分析業務資料,並瞭解使用 BigQuery 分析 Analytics 360 資料有何益處。進一步瞭解資料

這些資料是電子商務網站將看到的典型資料,包括以下資訊:

  • 流量來源資料:與網站訪客來源相關的資訊,包括自然流量、付費搜尋流量和多媒體廣告流量的資料
  • 內容資料:與使用者在網站上的行為相關的資訊,例如訪客瀏覽的網頁網址、訪客與內容互動的方式等。
  • 交易資料:在 Google 商品網路商店網站交易的相關資訊。

執行下列程式碼,在 Apache Spark 中顯示前 5 筆資料範例

# EXTRACT: Read data from BigQuery
print("Extracting data from BigQuery...")
ga_df = spark.read.format("bigquery") \
   .option("table", "bigquery-public-data.google_analytics_sample.ga_sessions_20170801") \
   .load()

# Show schema sample data
print("Sample data:")
ga_df.show(5, truncate=False)

預期的輸出內容:

Extracting data from BigQuery...
Sample data:
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
|visitorId|visitNumber|visitId   |visitStartTime|date    |totals                                                         |trafficSource                                                                                                                                                                          |device                                                                                                                                                                                                                                                                                                                                                                                                                      |geoNetwork                                                                                                                                                                                                                                                                     |customDimensions    |hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |fullVisitorId      |userId|clientId|channelGrouping|socialEngagementType|
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
|NULL     |1          |1501591568|1501591568    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{NULL, (not set), (direct), (none), NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}                         |{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Europe, Southern Europe, Greece, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, tellas.gr, not available in demo dataset, not available in demo dataset, not available in demo dataset}          |[]                  |[{1, 0, 5, 46, NULL, true, true, true, https://www.google.gr/, {/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /bags/, /google+zipper+front+sports+bag.axd, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, shop.googlemerchandisestore.com/google+redesign/bags/google+zipper+front+sports+bag.axd, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Bags, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]     |3418334011779872055|NULL  |NULL    |Organic Search |Not Socially Engaged|
|NULL     |2          |1501589647|1501589647    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 1}|{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Asia, Southern Asia, India, Maharashtra, (not set), Mumbai, not available in demo dataset, unknown.unknown, not available in demo dataset, not available in demo dataset, not available in demo dataset}                                                                      |[{4, APAC}]         |[{1, 0, 5, 14, NULL, true, true, true, https://analytics.google.com/analytics/web/, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                                        |2474397855041322408|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501616621|1501616621    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Europe, Northern Europe, United Kingdom, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, as9105.com, not available in demo dataset, not available in demo dataset, not available in demo dataset} |[{4, EMEA}]         |[{1, 0, 12, 43, NULL, true, true, true, https://analytics.google.com/analytics/web/?utm_source=demoaccount&utm_medium=demoaccount&utm_campaign=demoaccount, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]|5870462820713110108|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501601200|1501601200    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), analytics.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}|{Firefox, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop}|{Americas, Northern America, United States, Texas, Dallas-Ft. Worth TX, Dallas, not available in demo dataset, h5colo.com, not available in demo dataset, not available in demo dataset, not available in demo dataset}                                                        |[{4, North America}]|[{1, 0, 8, 26, NULL, true, true, true, https://analytics.google.com/analytics/web/, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                                        |9397809171349480379|NULL  |NULL    |Referral       |Not Socially Engaged|
|NULL     |1          |1501615525|1501615525    |20170801|{1, 1, 1, NULL, 1, NULL, NULL, 1, NULL, NULL, NULL, NULL, 1}   |{/analytics/web/, (not set), adwords.google.com, referral, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, not available in demo dataset, NULL, NULL, NULL, NULL, NULL}, NULL, NULL}  |{Chrome, not available in demo dataset, not available in demo dataset, Windows, not available in demo dataset, false, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, NULL, not available in demo dataset, not available in demo dataset, not available in demo dataset, desktop} |{Americas, Northern America, United States, not available in demo dataset, not available in demo dataset, not available in demo dataset, not available in demo dataset, (not set), not available in demo dataset, not available in demo dataset, not available in demo dataset}|[{4, North America}]|[{1, 0, 12, 25, NULL, true, true, true, https://adwords.google.com/analytics/web/?__o=cues&authuser=0, {/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com, Page Unavailable, NULL, NULL, /google+redesign/, /shop+by+brand/, /youtube, }, NULL, NULL, NULL, {NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, shop.googlemerchandisestore.com/google+redesign/shop+by+brand/youtube, 0}, {NULL, true, NULL, NULL}, NULL, [], [], NULL, NULL, {0, 1, NULL}, [], NULL, [], [], [], PAGE, {NULL, NULL, NULL, NULL, (not set), NULL, No,  : }, NULL, NULL, {(not set), Brands, (not set), (not set), (not set), (entrance), (entrance), (entrance), (entrance), (entrance), NULL, 1, NULL, NULL, NULL}, web, []}]                                                     |6089902943184578335|NULL  |NULL    |Referral       |Not Socially Engaged|
+---------+-----------+----------+--------------+--------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+------+--------+---------------+--------------------+
only showing top 5 rows

5. 依裝置瀏覽器匯總資料 (基本指標)

這段程式碼會依裝置瀏覽器匯總 Google Analytics 資料,計算各種指標 (例如工作階段總數、造訪次數、點擊次數、網頁瀏覽次數、跳出次數、網站停留時間、不重複訪客和每次造訪的平均時間),然後重新命名資料欄並排序結果。

print("Transformation 1: Aggregating by device browser...")
device_agg = ga_df.groupBy("device.browser") \
   .agg(
       count("*").alias("total_sessions"),
       sum("totals.visits").alias("total_visits"),
       sum("totals.hits").alias("total_hits"),
       sum("totals.pageviews").alias("total_pageviews"),
       sum("totals.bounces").alias("total_bounces"),
       sum("totals.timeOnSite").alias("total_time_on_site"),
       countDistinct("fullVisitorId").alias("unique_visitors"),
       (sum("totals.timeOnSite")/sum("totals.visits")).alias("avg_time_per_visit")
   ) \
   .withColumnRenamed("browser", "device_browser") \
   .orderBy("total_sessions", ascending=False)

# Show sample transformed data
print("\nDevice Aggregation Sample:")
device_agg.show(5)

預期的輸出內容:

Transformation 1: Aggregating by device browser...

Device Aggregation Sample:
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
|   device_browser|total_sessions|total_visits|total_hits|total_pageviews|total_bounces|total_time_on_site|unique_visitors|avg_time_per_visit|
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
|           Chrome|          1900|        1900|     10896|           8956|          870|            354691|           1689|186.67947368421054|
|           Safari|           397|         397|      1260|           1137|          218|             43150|            364|108.69017632241814|
|          Firefox|           101|         101|       390|            343|           59|             10659|             95|105.53465346534654|
|Internet Explorer|            54|          54|       107|            102|           36|              5589|             50|             103.5|
|             Edge|            23|          23|        63|             55|           12|              2623|             23|114.04347826086956|
+-----------------+--------------+------------+----------+---------------+-------------+------------------+---------------+------------------+
only showing top 5 rows

6. 分析流量來源並計算收益

這段程式碼會依流量來源和媒介分析 Google Analytics 資料,計算工作階段數、交易次數、總收益、單次工作階段收益和不重複訪客等指標,然後重新命名資料欄,並依總收益排序結果。

print("Transformation 2: Analyzing traffic sources...")
traffic_source_agg = ga_df.groupBy("trafficSource.source", "trafficSource.medium") \
   .agg(
       count("*").alias("session_count"),
       sum(when(col("totals.transactions").isNotNull(), 1).otherwise(0)).alias("transactions"),
       sum("totals.totalTransactionRevenue").cast(DecimalType(20,2)).alias("total_revenue"),
       (sum("totals.totalTransactionRevenue")/count("*")).cast(DecimalType(10,2)).alias("revenue_per_session"),
       countDistinct("fullVisitorId").alias("unique_visitors")
   ) \
   .withColumnRenamed("source", "traffic_source") \
   .withColumnRenamed("medium", "traffic_medium") \
   .orderBy("total_revenue", ascending=False)

print("\nTraffic Source Aggregation Sample:")
traffic_source_agg.show(5)

預期的輸出內容:

Transformation 2: Analyzing traffic sources...

Traffic Source Aggregation Sample:
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
|      traffic_source|traffic_medium|session_count|transactions|total_revenue|revenue_per_session|unique_visitors|
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
|            (direct)|        (none)|         2166|          42|8872040000.00|         4096048.01|           1943|
|     mail.google.com|      referral|            2|           1|  17960000.00|         8980000.00|              2|
|       google.com.tw|      referral|            1|           0|         NULL|               NULL|              1|
|analytics.google.com|      referral|           57|           0|         NULL|               NULL|             53|
|           quora.com|      referral|            6|           0|         NULL|               NULL|              5|
+--------------------+--------------+-------------+------------+-------------+-------------------+---------------+
only showing top 5 rows

7. 地理區域分析

這段程式碼會依國家/地區分析 Google Analytics 資料,計算工作階段數、網頁瀏覽總次數、網站停留總時間、平均單次工作階段時間和不重複訪客,然後重新命名資料欄,並依工作階段數排序。

print("Transformation 3: Geographic analysis...")
geo_agg = ga_df.groupBy("geoNetwork.country", "geoNetwork.region") \
   .agg(
       count("*").alias("session_count"),
       sum("totals.pageviews").alias("total_pageviews"),
       sum("totals.timeOnSite").alias("total_time_on_site"),
       (sum("totals.timeOnSite")/count("*")).alias("avg_time_per_session"),
       countDistinct("fullVisitorId").alias("unique_visitors")
   ) \
   .withColumnRenamed("country", "country") \
   .withColumnRenamed("region", "region") \
   .orderBy("session_count", ascending=False)

print("\nGeographic Aggregation Sample:")
geo_agg.show(5)

預期的輸出內容:

Transformation 3: Geographic analysis...

Geographic Aggregation Sample:
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
|       country|              region|session_count|total_pageviews|total_time_on_site|avg_time_per_session|unique_visitors|
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
| United States|not available in ...|          564|           2326|             97829|  173.45567375886526|            494|
| United States|          California|          420|           3102|            116563|   277.5309523809524|            347|
| United States|            New York|          109|            845|             39976|   366.7522935779817|             84|
|United Kingdom|not available in ...|           82|            161|              7791|   95.01219512195122|             79|
|         India|not available in ...|           62|            139|              2869|  46.274193548387096|             61|
+--------------+--------------------+-------------+---------------+------------------+--------------------+---------------+
only showing top 5 rows

8. 時間型分析

這段程式碼會從 visitStartTime 欄位擷取小時,然後依小時分組資料,計算每個小時的工作階段數、交易次數、總收益和總網頁瀏覽量,最後依小時排序結果,藉此執行時間分析。

print("Transformation 4: Time-based analysis...")
hourly_agg = ga_df.withColumn("hour", date_format(col("visitStartTime").cast("timestamp"), "H")) \
   .groupBy("hour") \
   .agg(
       count("*").alias("session_count"),
       sum("totals.transactions").alias("transactions"),
       sum("totals.totalTransactionRevenue").cast(DecimalType(20,2)).alias("total_revenue"),
       sum("totals.pageviews").alias("total_pageviews")
   ) \
   .orderBy("hour")

print("\nHourly Aggregation Sample:")
hourly_agg.show(5)

預期輸出內容:

Transformation 4: Time-based analysis...

Hourly Aggregation Sample:
+----+-------------+------------+-------------+---------------+
|hour|session_count|transactions|total_revenue|total_pageviews|
+----+-------------+------------+-------------+---------------+
|   0|           87|        NULL|         NULL|            372|
|   1|          102|        NULL|         NULL|            494|
|  10|           67|        NULL|         NULL|            149|
|  11|           73|        NULL|         NULL|            167|
|  12|           99|        NULL|         NULL|            313|
+----+-------------+------------+-------------+---------------+
only showing top 5 rows

9. 將結果寫入 BigQuery 資料表

這段程式碼會使用直接寫入方法,將四個匯總資料架構 (device_aggtraffic_source_agggeo_agghourly_agg) 匯出至 Google BigQuery 中的個別資料表,並覆寫現有資料表 (如有)。

# Write to BigQuery tables
print("\nLoading data to BigQuery...")

# Set output tables
device_output_table = f"{project_id}.analytics_sample.device_aggregation"
traffic_output_table = f"{project_id}.analytics_sample.traffic_source_aggregation"
geo_output_table = f"{project_id}.analytics_sample.geo_aggregation"
hourly_output_table = f"{project_id}.analytics_sample.hourly_aggregation"

dataset_id = "demo"  # Replace with your BigQuery dataset ID 

# Set BigQuery output table
device_output_table = f"{project_id}.{dataset_id}.device_aggregation"
traffic_output_table = f"{project_id}.{dataset_id}.traffic_source_aggregation"
geo_output_table = f"{project_id}.{dataset_id}.geo_aggregation"
hourly_output_table = f"{project_id}.{dataset_id}.hourly_aggregation"

# Write each DataFrame to BigQuery
device_agg.write \
   .format("bigquery") \
   .option("table", device_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

traffic_source_agg.write \
   .format("bigquery") \
   .option("table", traffic_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

geo_agg.write \
   .format("bigquery") \
   .option("table", geo_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

hourly_agg.write \
   .format("bigquery") \
   .option("table", hourly_output_table) \
   .option("writeMethod", "direct") \
   .mode("overwrite") \
   .save()

在 BigQuery 中檢查輸出資料表,確認資料在經過轉換後已成功儲存

6227a951f1d1d4e6.png

device_aggregation 表格

da7791f5d15f67be.png

geo_aggregation 表格

76ec9d9dc9929c1c.png

hourly_aggregation 資料表

bcf627ef6a13c8cc.png

traffic_source_aggregation

f373d64ace332074.png

10. 協調 BigQuery Studio 筆記本程式碼 (選用)

您可以透過下列方式協調 BigQuery Studio 筆記本程式碼:

在本程式碼研究室中,我們將使用 Google Cloud 控制台排定筆記本程式碼。

  1. 在「Notebook」(筆記本) 工具列中,按一下「Schedule」(排程)
  2. 在「排定筆記本」窗格的「排程名稱」欄位中,輸入排程名稱。
  3. 在「Authentication」(驗證) 部分,使用 Google 帳戶使用者憑證或服務帳戶授權筆記本。
  4. 如要使用 Google 帳戶使用者憑證 ( 預覽),請選取「以我的使用者憑證執行」
  5. 如要使用服務帳戶,請選取「以所選服務帳戶執行」,然後選取服務帳戶。
  6. 在「筆記本選項」部分,於「執行階段範本」欄位中,選取 Colab 筆記本執行階段範本或預設執行階段規格。如要進一步瞭解如何建立 Colab 筆記本執行階段範本,請參閱「建立執行階段範本」一文。
  7. 在「Cloud Storage bucket」(Cloud Storage 值區) 欄位中,按一下「Browse」(瀏覽),然後選取或建立 Cloud Storage 值區。您必須在所選值區中,將「Storage 管理員 (roles/storage.admin)」IAM 角色授予所選服務帳戶。詳情請參閱「啟用筆記本排程」。
  8. 在「排程頻率」部分執行下列操作:
  9. 在「Repeats」(重複頻率) 選單中,選取排定筆記本執行的頻率。
  10. 在「At time」(時間) 欄位中,輸入排定筆記本執行的時間。
  11. 在「時區」選單中,選取時間表的時區。
  12. 按一下「建立時間表」。如果驗證方法選取「使用我的使用者憑證執行」,您必須授權給您的 Google 帳戶 ( 預覽)。

f3d87c9a3408b08d.png

11. 清除

如要避免系統向您的 Google Cloud 帳戶收取本程式碼研究室所用資源的費用,請按照下列步驟操作:

  1. 前往 Google Cloud 控制台的「管理資源」頁面。
  2. 在專案清單中選取要刪除的專案,然後點按「刪除」。
  3. 在對話方塊中輸入專案 ID,然後按一下「Shut down」(關閉) 即可刪除專案。

12. 恭喜

您已在 BigQuery Studio 筆記本中,運用 Apache Spark 無伺服器功能完成資料轉換和分析。在過程中,您探索了 Google Cloud 的公開資料集,並在 BigQuery Studio Notebook 中使用 Apache Spark 執行無伺服器 ETL,以及協調 BigQuery Studio Notebook。太棒了!

後續步驟:

  • 使用服務帳戶協調管理筆記本,以利自動化作業。
  • 執行 ETL 工作時,新增指令碼來監控時間長度。
  • 部署筆記本
  • 使用 DataprocSparkSession,在 BigQuery Studio Notebook 中透過 Serverless Dataproc 運用真正的分散式 Apache Spark 功能。
  • 在 BigQuery Studio 中建立 Apache Spark 適用的預存程序。這樣一來,您就能套用 OOP 原則來建構 PySpark 程式碼,進而提升程式碼的組織性、重複使用率和可維護性。

參考資料