Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Apache Iceberg 및 BigLake를 사용하여 AI를 위한 통합 데이터 레이크하우스 빌드

1. 소개

이 Codelab에서는 Google Cloud의 통합 데이터 레이크하우스 기능을 살펴봅니다. BigLake의 Apache Iceberg REST 카탈로그를 통해 제공되는 공개 데이터 세트와 상호작용하고, 정형 데이터와 비정형 데이터 모두에 Google Cloud의 AI 기능을 적용합니다.

Apache Iceberg를 사용하여 기존 NYC 택시 데이터 세트를 쿼리하고, 시간 이동을 통해 데이터 변경사항을 감사한 다음, BigQuery ML 및 Gemini를 사용하여 데이터에 대해 AI 모델을 실행합니다.

실습할 내용

Apache Spark용 Google Cloud 서버리스를 사용하여 BigLake에서 호스팅되는 Apache Iceberg 공개 데이터 세트를 쿼리합니다.
Apache Iceberg 형식의 구조화된 데이터를 쿼리합니다.
Apache Iceberg에서 Time Travel을 시연합니다.
BigQuery ML을 사용하여 구조화된 데이터에 예측 모델을 학습시킵니다.
BigLake 객체 테이블 (비정형 데이터)을 만들고 Gemini를 사용하여 이미지를 분석합니다.

필요한 항목

웹브라우저(예: Chrome)
결제가 사용 설정된 Google Cloud 프로젝트.

예상 비용 및 기간

소요 시간: 약 45분
예상 비용: 2달러 미만 Google에서는 비용을 낮게 유지하기 위해 공개 데이터 세트와 서버리스 쿼리를 사용합니다.

2. 설정 및 요구사항

이 단계에서는 환경을 준비하고 필요한 API를 사용 설정합니다.

Cloud Shell 시작

Google Cloud Shell에서 대부분의 명령어를 실행합니다.

Google Cloud 콘솔 상단에서 Cloud Shell 활성화를 클릭합니다.
인증을 확인합니다.
```
gcloud auth list
```
프로젝트를 확인합니다.
```
gcloud config get project
```
프로젝트가 설정되지 않은 경우 프로젝트 ID를 사용하여 설정합니다.
```
gcloud config set project <YOUR_PROJECT_ID>
```

API 사용 설정

다음 명령어를 실행하여 BigQuery, Cloud Resource Manager, Vertex AI에 필요한 API를 사용 설정합니다.

gcloud services enable \
  bigquery.googleapis.com \
  aiplatform.googleapis.com \
  cloudresourcemanager.googleapis.com

환경 구성 및 종속 항목 버킷 생성

터미널에서 환경 변수를 설정합니다.

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export DEPS_BUCKET=$PROJECT_ID-deps-bucket

종속 항목 Cloud Storage 버킷을 만듭니다. PySpark 스크립트는 작업 제출 시 여기에 업로드됩니다.
```
gcloud storage buckets create gs://$DEPS_BUCKET --location=$REGION
```

3. Apache Iceberg 공개 카탈로그에 연결

이 단계에서는 Google Cloud의 BigLake에서 호스팅되는 라이브 프로덕션 등급 Apache Iceberg 카탈로그에 연결합니다.

Apache Spark용 서버리스 일괄 CLI로 Spark SQL 실행

인프라를 관리하지 않고도 PySpark 작업을 실행하기 위해 Apache Spark용 Google Cloud 서버리스를 사용합니다. 공개 BigLake REST 카탈로그를 가리키도록 구성합니다.

반복을 피하기 위해 BigLake REST 카탈로그 속성을 정의합니다. 이 구성은 Spark에 다음을 알려줍니다.

iceberg-spark-runtime 및 iceberg-gcp-bundle 라이브러리를 사용합니다.
BigLake REST 카탈로그 엔드포인트를 사용하여 my_catalog라는 카탈로그를 구성합니다.
기본 로컬 파일 시스템 대신 Google Cloud Storage (GCS)를 사용하여 데이터 파일을 읽습니다.
이 my_catalog 카탈로그를 세션의 기본값으로 설정합니다.
보안을 강화하고 데이터 액세스를 간소화하기 위해 제공된 사용자 인증 정보를 사용합니다.

export METASTORE_PROPERTIES="^|^spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0|\
spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog|\
spark.sql.catalog.my_catalog.type=rest|\
spark.sql.catalog.my_catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog|\
spark.sql.catalog.my_catalog.warehouse=gs://biglake-public-nyc-taxi-iceberg|\
spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO|\
spark.sql.catalog.my_catalog.header.x-goog-user-project=$PROJECT_ID|\
spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation=vended-credentials|\
spark.sql.catalog.my_catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager|\
spark.sql.defaultCatalog=my_catalog|\
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions|\
spark.log.level=ERROR"

간단한 테스트 쿼리 파일을 만듭니다.

cat <<EOF > test.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.sql("SHOW TABLES IN public_data").show()
EOF

일괄 작업을 제출합니다.

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  test.py

일괄 작업이 완료될 때까지 몇 분 정도 기다립니다. 작업이 완료되면 다음과 비슷한 출력이 표시됩니다.

+-----------+----------------+-----------+
|  namespace|       tableName|isTemporary|
+-----------+----------------+-----------+
|public_data|     nyc_taxicab|      false|
|public_data|nyc_taxicab_2021|      false|
+-----------+----------------+-----------+

4. 구조화된 Iceberg 데이터 쿼리

연결되면 데이터 세트에 대한 전체 SQL 액세스 권한이 부여됩니다. Iceberg 테이블로 모델링된 NYC 택시 데이터 세트를 쿼리합니다.

표준 집계 쿼리 실행

query.py이라는 파일을 만듭니다.

cat <<EOF > query.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query = """
SELECT
  passenger_count,
  COUNT(1) AS num_trips,
  ROUND(AVG(total_amount), 2) AS avg_fare,
  ROUND(AVG(trip_distance), 2) AS avg_distance
FROM public_data.nyc_taxicab
WHERE data_file_year = 2021 AND passenger_count > 0
GROUP BY passenger_count
ORDER BY num_trips DESC
"""

spark.sql(query).show()
EOF

Apache Spark용 서버리스를 사용하여 제출합니다.

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  query.py

일괄 작업이 완료될 때까지 몇 분 정도 기다립니다.

작업이 완료되면 다음과 비슷한 출력이 표시됩니다.

+---------------+---------+--------+------------+
|passenger_count|num_trips|avg_fare|avg_distance|
+---------------+---------+--------+------------+
|              1| 21508009|   18.82|        3.03|
|              2|  4424746|   20.22|        3.40|
|              3|  1164846|   19.84|        3.27|
|              5|   718282|   18.88|        3.07|
|              4|   466485|   20.61|        3.44|
|              6|   452467|   18.97|        3.11|
|              7|       78|   65.24|        3.71|
|              8|       49|   57.39|        5.88|
|              9|       35|   73.26|        6.20|
|             96|        1|   17.00|        2.00|
|            112|        1|   15.00|        2.00|
+---------------+---------+--------+------------+

여기서 Apache Iceberg를 사용하는 이유는 무엇인가요?

파티션 가지치기: 쿼리가 data_file_year = 2021에서 필터링됩니다. Iceberg를 사용하면 엔진이 다른 연도의 데이터 스캔을 완전히 건너뛸 수 있습니다.
엔진 민첩성: 데이터를 복사하지 않고도 Spark, Trino 또는 BigQuery에서 실행할 수 있습니다.

5. Apache Iceberg의 시간 이동

Iceberg의 가장 강력한 기능 중 하나는 시간 이동입니다. 이 기능을 사용하면 이전 버전 또는 스냅샷에 있던 데이터를 쿼리할 수 있습니다.

표 기록 보기

history.py이라는 파일을 만듭니다.

cat <<EOF > history.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.sql("SELECT * FROM public_data.nyc_taxicab.history").show()
EOF

다음을 제출합니다.

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  history.py

콘솔에 다음과 비슷한 출력이 표시됩니다.

+--------------------+-------------------+-------------------+-------------------+
|     made_current_at|        snapshot_id|          parent_id|is_current_ancestor|
+--------------------+-------------------+-------------------+-------------------+
|2026-01-07 21:32:...|6333415779680505547|               NULL|               true|
|2026-01-07 21:34:...|1840345522877675925|6333415779680505547|               true|
|2026-01-07 21:36:...|7203554539964460256|1840345522877675925|               true|
|2026-01-07 21:38:...|4573466015237516024|7203554539964460256|               true|
|2026-01-07 21:40:...|3353190952148867790|4573466015237516024|               true|
|2026-01-07 21:42:...|1335547378580631681|3353190952148867790|               true|
|2026-01-07 21:44:...|8203141258229894239|1335547378580631681|               true|
|2026-01-07 21:46:...|1597048231706307813|8203141258229894239|               true|
|2026-01-07 21:48:...|6247811509231462655|1597048231706307813|               true|
|2026-01-07 21:50:...|2527184310045633322|6247811509231462655|               true|
|2026-01-07 21:52:...|2512764101237223642|2527184310045633322|               true|
|2026-01-07 21:52:...|7045957533358062548|2512764101237223642|               true|
|2026-01-07 21:53:...| 531753237516076726|7045957533358062548|               true|
|2026-01-07 21:53:...|4184653573199718274| 531753237516076726|               true|
|2026-01-07 21:54:...|5125223829492177301|4184653573199718274|               true|
|2026-01-07 21:54:...|6844673237417600305|5125223829492177301|               true|
|2026-01-07 21:54:...|6634828203344518093|6844673237417600305|               true|
|2026-01-07 21:55:...|7637728273407236194|6634828203344518093|               true|
|2026-01-07 21:55:...|3424071684958740192|7637728273407236194|               true|
|2026-01-07 21:55:...|1743746294196424254|3424071684958740192|               true|
+--------------------+-------------------+-------------------+-------------------+

다양한 스냅샷 ID와 커밋된 시점을 나타내는 행이 표시됩니다.

현재 행 수와 이전 행 수 비교

timetravel.py이라는 파일을 만듭니다.

cat <<EOF > timetravel.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

query = """
SELECT 'Current State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab
UNION ALL
SELECT 'Past State' AS version, COUNT(*) AS count FROM public_data.nyc_taxicab VERSION AS OF 4573466015237516024
"""

spark.sql(query).show()
EOF

다음을 제출합니다.

gcloud dataproc batches submit pyspark \
  --project=$PROJECT_ID \
  --region=$REGION \
  --version=2.3 \
  --properties="$METASTORE_PROPERTIES" \
  --deps-bucket=gs://$DEPS_BUCKET \
  timetravel.py

콘솔에 다음과 비슷한 출력이 표시됩니다.

+-------------+----------+
|      version|     count|
+-------------+----------+
|Current State|1293069366|
|   Past State|  72878594|
+-------------+----------+

이렇게 하면 시간 경과에 따른 데이터 변경사항을 감사할 수 있습니다.

6. BigQuery ML을 사용한 구조화된 AI

이제 Iceberg 데이터를 살펴봤으니 BigQuery AI 기능을 사용해 보겠습니다. 공개 Iceberg 카탈로그는 읽기 전용이므로 BigQuery를 사용하여 공개 테이블에서 읽어 작업공간에서 모델을 학습시킬 수 있습니다.

로컬 데이터 세트 만들기

먼저 bq CLI를 사용하여 프로젝트에 AI 모델을 저장할 데이터 세트를 만듭니다.

bq mk --location=$REGION --project_id=$PROJECT_ID iceberg_ai

선형 회귀 모델 학습

이제 공개 BigLake Iceberg 테이블을 사용하여 선형 회귀 모델을 학습시킵니다.

bq query을 사용하여 쿼리 파일을 만들고 모델을 학습시킵니다.

cat <<'EOF' > train_model.sql
CREATE OR REPLACE MODEL `iceberg_ai.predict_fare`
OPTIONS(model_type='LINEAR_REG', input_label_cols=['fare_amount']) AS
SELECT fare_amount, passenger_count, CAST(trip_distance AS FLOAT64) AS trip_distance
FROM `bigquery-public-data`.`biglake-public-nyc-taxi-iceberg`.public_data.nyc_taxicab
WHERE fare_amount > 0 AND trip_distance > 0 AND RAND() < 0.01; -- Using 1% of data to downsample
EOF

bq query --location=$REGION --use_legacy_sql=false < train_model.sql

모델을 사용하여 예측

이제 모델이 학습되었으므로 ML.PREDICT를 사용하여 새 여정의 요금을 예측할 수 있습니다.

쿼리 파일을 만들고 bq query를 사용하여 예측을 실행합니다.

cat <<'EOF' > predict_fare.sql
SELECT
  predicted_fare_amount, passenger_count, trip_distance
FROM
  ML.PREDICT(MODEL `iceberg_ai.predict_fare`,
    (
    SELECT 2 AS passenger_count, 5.0 AS trip_distance
    )
  );
EOF

bq query --location=$REGION --use_legacy_sql=false < predict_fare.sql

다음과 비슷한 출력이 표시됩니다.

+-----------------------+-----------------+---------------+
| predicted_fare_amount | passenger_count | trip_distance |
+-----------------------+-----------------+---------------+
|     14.12252095150709 |               2 |           5.0 |
+-----------------------+-----------------+---------------+

7. BigLake를 사용한 비정형 AI

데이터는 행과 열로만 구성되지 않습니다. 통합 데이터 레이크하우스는 구조화되지 않은 데이터 (이미지, PDF)도 처리합니다. 객체 테이블과 객체 참조를 사용하여 비정형 데이터를 쿼리해 보겠습니다.

객체 테이블은 Cloud Storage 경로의 객체를 나열하는 BigQuery의 읽기 전용 외부 테이블입니다. 각 행은 파일을 나타내며, uri, size과 같은 메타데이터 열과 ObjectRef이 포함된 특수 ref 열이 있습니다.

객체 참조 (ObjectRef)는 단일 파일의 실제 데이터를 가리킵니다. 최신 BigQuery ML 함수 (예: AI.GENERATE 또는 AI.AGG)는 바이트를 표준 테이블에 로드하지 않고 분석을 위해 파일 콘텐츠 (이미지, 오디오 또는 텍스트)를 읽기 위해 ObjectRef를 사용합니다.

비정형 AI용 데이터 세트 만들기

먼저 US 멀티 리전에서 bq CLI를 사용하여 객체 테이블을 저장할 두 번째 데이터 세트를 프로젝트에 만듭니다.

bq mk --location=US --project_id=$PROJECT_ID iceberg_object_ai

외부 연결 만들기

BigQuery에서 Cloud Storage에 저장된 데이터 (객체 테이블과 구조화되지 않은 데이터 모두)를 쿼리하려면 외부 연결을 만들어야 합니다.

Cloud Shell에서 다음을 실행하여 Cloud 리소스 연결을 만듭니다.

bq mk --connection --project_id=$PROJECT_ID --location=US --connection_type=CLOUD_RESOURCE my-conn

연결을 위해 생성된 서비스 계정 ID를 찾습니다.

CONNECTION_SA=$(bq show --format=json --project_id=$PROJECT_ID --connection $PROJECT_ID.us.my-conn | jq -r '.serviceAccountId // .cloudResource.serviceAccountId')

서비스 계정이 Gemini 모델을 호출하고 GCS 데이터를 읽을 수 있도록 서비스 계정에 Vertex AI 사용자 및 스토리지 객체 뷰어 역할을 부여합니다.

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$CONNECTION_SA" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$CONNECTION_SA" \
  --role="roles/storage.objectViewer"

객체 테이블 만들기

이전 섹션에서 만든 외부 연결 my-conn을 사용하여 구조화되지 않은 데이터에 액세스합니다. 쿼리 파일을 만들고 bq query를 사용하여 객체 테이블을 만듭니다.

cat <<'EOF' > create_object_table.sql
CREATE EXTERNAL TABLE `iceberg_object_ai.sample_images`
WITH CONNECTION `us.my-conn`
OPTIONS (
  object_metadata = 'SIMPLE',
  uris = ['gs://cloud-samples-data/vision/landmark/*']
);
EOF

bq query --use_legacy_sql=false < create_object_table.sql

객체 데이터에서 Gemini 사용하기

이제 Gemini를 사용하여 이미지를 다운로드하지 않고 평가하는 쿼리를 실행하세요.

bq query를 통해 표준 SQL을 사용하여 이미지를 쿼리합니다.

cat <<EOF > query_images.sql
SELECT
  uri,
  image_analysis.description
FROM (
  SELECT
    uri,
    AI.GENERATE(
      (
        'Identify what is happening in the image.',
        ref
      ),
      connection_id => 'us.my-conn',
      endpoint => 'gemini-2.5-flash-lite',
      output_schema => 'event STRING, severity STRING, description STRING'
    ) AS image_analysis
  FROM
    iceberg_object_ai.sample_images
);
EOF

bq query --use_legacy_sql=false < query_images.sql

샘플 출력:

+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                           uri                            |                                                                                                                                                                                                                             description                                                                                                                                                                                                                             |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gs://cloud-samples-data/vision/landmark/eiffel_tower.jpg | The Eiffel Tower stands tall against a cloudy sky, overlooking the Seine River in Paris. Boats are docked along the riverbank, and trees line the opposite shore, with bridges and buildings visible in the distance.                                                                                                                                                                                                                                               |
| gs://cloud-samples-data/vision/landmark/pofa.jpg         | A wide shot shows the Palace of Fine Arts, a monumental structure in San Francisco, California. The building features a large rotunda with a dome, surrounded by colonnades. In front of the rotunda is a lagoon. Several people are walking around the grounds. The sky is blue with a few scattered clouds.                                                                                                                                                       |
| gs://cloud-samples-data/vision/landmark/st_basils.jpeg   | A monument stands in front of Saint Basil's Cathedral in Moscow under a bright blue sky with scattered white clouds. The cathedral features distinctive onion domes in various colors and patterns, including red, blue and white stripes, green and beige stripes, and red and blue diamonds. A large green tree partially obscures the left side of the cathedral. People are visible in the foreground near the base of the monument and the cathedral entrance. |
+----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

ObjectRef 직접 탐색: 감정 분석

객체 테이블은 파일 참조를 자동으로 관리하지만 BigQuery 객체 참조를 사용하여 이러한 객체와 직접 상호작용하여 단일 파일에 대한 즉석 분석을 실행할 수 있습니다.

예를 들어 이전에 만든 $DEPS_BUCKET 변수를 사용하여 자체 GCS 버킷에 저장된 작은 텍스트 파일을 사용하고 bq query와 함께 OBJ.MAKE_REF를 사용하여 분석할 수 있습니다.

먼저 작은 텍스트 파일을 만들고 버킷에 업로드합니다.

cat <<'EOF' > review.txt
This product is fantastic! It exceeded my expectations. The quality is top-notch. I highly recommend it to everyone!
EOF

gcloud storage cp review.txt gs://${DEPS_BUCKET}/review.txt

이제 표준 SQL 내에서 OBJ.MAKE_REF를 사용하여 파일을 쿼리합니다.

cat <<EOF > sentiment_analysis.sql
SELECT
  AI.GENERATE(
    (
      'Analyze the sentiment of this text file. Is it positive, negative, or neutral? Explain why.',
      OBJ.MAKE_REF('gs://${DEPS_BUCKET}/review.txt', 'us.my-conn')
    ),
    connection_id => 'us.my-conn',
    endpoint => 'gemini-2.5-flash-lite'
  ).result AS ml_generate_text_result;
EOF

bq query --use_legacy_sql=false < sentiment_analysis.sql