Turn Dark Data into Structured Gold

1. Overview

We all know the pain of "dark data." It's the PDFs, images, and text files sitting in cloud storage buckets, completely invisible to your SQL queries and BI dashboards. Traditionally, unlocking this data required complex OCR pipelines, manual data entry, or fragile custom scripts.

Not anymore.

In this lab, I am going to show you how to convert 400 unstructured PDF files — spanning text, tables, and images — into cleanly structured BigQuery tables with relationships automatically inferred between them. And we are going to do it in minutes using BigQuery Knowledge Catalog and Dataplex.

What you'll build

To make this real, let's look at a fictional business: a fast-growing Frozen Yogurt franchise.

Imagine you manage the data for this Froyo business. You have hundreds of recipes and supplier spec sheets, all saved as PDFs. The business leaders want to launch an AI agent to help store managers and customers query product details.

Here is the nightmare scenario: A customer asks, "I'm really interested in your Midnight Swirl froyo. Are there any allergens in it?"

To answer this, your system would normally have to:

  1. Find the "Midnight Swirl" recipe PDF.
  2. Read the ingredients (e.g., "Cocoa Powder", "Dairy Base", "Emulsifier X").
  3. Search through dozens of Supplier PDFs to find the spec sheets for those specific ingredients.
  4. Check the supplier sheets for hidden allergens tied to those ingredients.

Trying to build an AI agent that does this on the fly by reading 400 raw PDFs at runtime is slow, expensive, and prone to hallucination. Instead, we are going to use semantic inference to extract all of this into a relational database first, making our future AI agent lightning-fast and 100% grounded in factual SQL data.

Let's start building!

2381f1453211387d.png

What you'll learn

  • How to set up Cloud Storage bucket for the source files (PDFs)
  • How to set up and run Datascan job and semantic inference in Knowledge Catalog to extract data from source pdfs and semantically infer the connections and context and store it in BigQuery
  • How to use BigQuery Agents to chat with the newly created dataset

Requirements

  • A browser, such as Chrome or Firefox.
  • A Google Cloud project with billing enabled.
  • Basic familiarity with SQL and Java.

2. Before you begin

Create a project

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
  2. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
  1. You'll use Cloud Shell, a command-line environment running in Google Cloud. Click Activate Cloud Shell at the top of the Google Cloud console.

Activate Cloud Shell button image

  1. Once connected to Cloud Shell, you check that you're already authenticated and that the project is set to your project ID using the following command:
gcloud auth list
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project.
gcloud config list project
  1. If you want to authenticate
gcloud auth login
  1. If your project is not set, use the following command to set it:
export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project <YOUR_PROJECT_ID>
  1. Enable the required APIs: Run this command to enable all the required APIs:
gcloud services enable \
    dataplex.googleapis.com \
    datacatalog.googleapis.com \
    discoveryengine.googleapis.com \
    bigqueryconnection.googleapis.com \
    bigquery.googleapis.com \
    aiplatform.googleapis.com \
    cloudresourcemanager.googleapis.com \
    serviceusage.googleapis.com \
    storage.googleapis.com

Gotchas & Troubleshooting

The "Ghost Project" Syndrome

You ran gcloud config set project, but you're actually looking at a different project in the Console UI. Check the project ID in the top-left dropdown!

The Billing Barricade

You enabled the project, but forgot the billing account. AlloyDB is a high-performance engine; it won't start if the "gas tank" (billing) is empty.

API Propagation Lag

You clicked "Enable APIs," but the command line still says Service Not Enabled. Give it 60 seconds. The cloud needs a moment to wake up its neurons.

Quota Quags

If you're using a brand-new trial account, you might hit a regional quota for AlloyDB instances. If us-central1 fails, try us-east1.

"Hidden" Service Agent

Sometimes the AlloyDB Service Agent isn't automatically granted the aiplatform.user role. If your SQL queries can't talk to Gemini later, this is usually the culprit.

3. Google Cloud Storage Bucket Setup

In this section, you create an organizational structure within BigQuery to store Froyo recipe and supplier data, specifically for Froyo product details. It also establishes a Cloud Resource Connection, which acts as a secure "bridge" allowing BigQuery to read files from external sources like Cloud Storage.

Before you begin:

This repository contains recipes, suppliers PDF files we'll use in this project. Make sure you download these files. To download the files, do the following.

In the Cloud Shell, run the following command:

git clone --depth 1 --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/next-26-keynotes.git

Navigate into the newly created folder:

cd next-26-keynotes

Pull the data-cloud-demo folder

git sparse-checkout set genkey/data-cloud-demo

After the checkout is complete, navigate to the data-cloud-demo folder and extract the ZIP files to access the codelab assets.

Create bucket and upload the Froyo (recipes & suppliers) pdf files

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.
  2. Click Create.
  3. On the Create a bucket page, enter your bucket information. After each of the following steps, click Continue to proceed to the next step:
  4. In the Get started section, enter the bucket name. Eg.: froyo_data
  5. In the Choose where to store your data section, select Region and then enter your region. us-central1
  6. In the Choose how to control access to objects section, clear the Enforce public access prevention on this bucket checkbox.
  7. Click Create.
  8. In the list of buckets, click the bucket you created.
  9. In the Objects tab for the bucket, click Upload and then Upload folders.
  10. Select the recipes folder that you extracted in the Before you begin section of this codelab.
  11. Click Upload.
  12. Repeat the upload process for the suppliers folder.

Once uploaded, your bucket structure should look like (whatever the bucket name is):

596b8acb481016b7.png

4. BigQuery Connection Setup

Create a Cloud Resource Connection. This generates a unique Service Account that acts as BigQuery's "ID card" to access external files.

  • Go to the BigQuery page.
  • In the left pane, click Explorer. If you don't see the left pane, click Expand left pane to open the pane.
  • In the Explorer pane, expand your project name, and then click Connections.
  • On the Connections page, click Create connection.
  • For Connection type, choose Vertex AI remote models, remote functions, BigLake and Spanner (Cloud Resource).
  • In the Connection ID field, enter the connection ID name:
  • bq-connection. Make sure to note this ID as you'll need it when you set up the data scan later in this codelab.
  • Set Location type to Region and then select a region. For example, us-central1. The connection should be located in the same region as your other resources such as datasets.
  • Click Create connection.
  • Click Go to connection.
  • In the Connection info pane, copy the service account ID for use in a later step. The service account looks similar to bqcx-**********-qn3a@gcp-sa-bigquery-condel.iam.gserviceaccount.com.

5. Permissions Setup

  1. Grant the necessary permissions to the BigQuery connection for accessing Cloud Storage objects and the Knowledge Catalog

Go to the IAM & Admin page and in the View by Principals section, click the Grant access button, add a principal by pasting the service account you copied in the last step. In the roles section, add the names of the following roles one by one and save:

  • roles/storage.objectUser
  • roles/storage.objectViewer
  • roles/bigquery.user
  • roles/bigquery.dataEditor
  • roles/aiplatform.viewer
  • roles/agentplatform.user
  • roles/storage.admin
  • roles/dataproc.serviceAgent
  • roles/dataplex.discoveryPublishingServiceAgent
  • roles/dataplex.serviceAgent
  • roles/dataplex.securityAdmin
  1. Grant Dataplex Service Account permissions to access the Cloud Storage Bucket

Go to IAM & Admin page and in the View by Principals section, click Grant access button and add a principal by typing Dataplex into the New principal text bar. From the list that auto completes, select the Dataplex Service Account principal that looks similar to this:

service-*********@ gcp-sa-dataplex.iam.gserviceaccount.com

Grant the following roles to this Service Account:

  • roles/storage.objectUser
  • roles/storage.objectViewer
  • roles/storage.viewer
  • roles/dataplex.discoveryBigLakePublishingServiceAgentroles/storage.objectUser
  • roles/storage.objectViewer
  • roles/storage.viewer
  • roles/dataplex.discoveryBigLakePublishingServiceAgentroles/storage.objectUser
  • roles/storage.objectViewer
  • roles/storage.viewer
  • roles/dataplex.discoveryBigLakePublishingServiceAgent

6. Knowledge Catalog Setup

Build a Knowledge Catalog to unify the unstructured data and automate the discovery of unstructured files (such as PDF recipes and PDF suppliers).

  1. Create the DataScan through curl

You can create this job from the console too but in this section, we will create scans for the Cloud Storage bucket by adding the datascan_ID and pointing it to the BigQuery datasets. After that Knowledge Catalog will automatically create entries for your PDFs in BigQuery.

Run the below set of commands from the Cloud Shell Terminal in your current active project of interest:

# 1. Set your variables
PROJECT_ID="<PROJECT_ID>"
REGION="<REGION>"
ENV_SUFFIX="stg1"
DATASCAN_ID="froyo-data-${ENV_SUFFIX}"
BUCKET_NAME="<BUCKET_NAME>"

# 2. Set this to the Name of the connection you created in Step 7
CONNECTION_ID="<CONNECTION_ID_NAME>"

# 3. Define the API Endpoint
DATAPLEX_API="dataplex.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}"

# 4. Create the DataScan via CURL
echo "Creating Dataplex DataScan: ${DATASCAN_ID}..."

curl -X POST "https://$DATAPLEX_API/dataScans?dataScanId=${DATASCAN_ID}" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d '{
"data": {
   "resource": "//storage.googleapis.com/projects/'"${PROJECT_ID}"'/buckets/'"${BUCKET_NAME}"'"
   },
"executionSpec": {
   "trigger": {
      "on_demand": {}
   }
},
"dataDiscoverySpec": {
   "bigqueryPublishingConfig": {
      "tableType": "BIGLAKE",
      "connection": "projects/'"${PROJECT_ID}"'/locations/'"${REGION}"'/connections/'"${CONNECTION_ID}"'"
   },
   "storageConfig": {
      "unstructuredDataOptions": {
      "entity_inference_enabled": true
      }
   }
   }
}'

Replace the values of placeholder Variables above in order to get the command to run successfully. Remember this just creates the job, does not run it.

  1. After a minute or 2, in the Google Cloud console, go to the Metadata curation page.
  2. In the Cloud Storage discovery tab, click the Name of the discovery scan item (remember not to click the source link which might be a tempting choice, but it will open storage).

31afb1dc3caba5f6.png

  1. Critical Step:

Edit the job and make sure the "Enable semantic inference" checkbox is checked. If not, check it, enter project id and region and then click save.

c021927bf80fe204.png

Once done, you can either run the following command from Cloud Shell Terminal or click the Run now button in the Cloud Storage Discovery page:

gcloud dataplex datascans run $DATASCAN_ID --location=$REGION
  1. It will take some time to complete the scan job. Once the job finishes, check if the Published dataset is present. To check the job status, you can check in the Metadata curation page, in the Cloud Storage discovery tab, click the name of the discovery scans of the recent run. you should see the published dataset as seen below:

d0cd6ae1edb81ce9.png

Note: If you run into errors in the scan step, just give it some time and then retry (it takes a few minutes to get the job created and complete execution).

You can view the table in BigQuery by clicking and navigating to the froyo_data dataset. Click the table id in BigQuery and run the below query in the Query Editor tab:

SELECT count(*) FROM `agent-data-cloud.froyo_data.froyo_data` LIMIT 1000;

This results in 400 (if not you can go back and run the Datascan job again).

7. Semantic Data Extraction

Great!! Now lets extract the inference for these unstructured objects using Knowledge Catalog.

We will use the Insights feature to generate SQL statements to extract structured data from the unstructured table

  1. In the Google Cloud console, go to the Knowledge Catalog Search page.
  2. Search for the dataset table that you want to view insights for. In the search bar, enter the dataset / table name from previous step: "froyo_data" and hit enter
  3. From the result list, click the TABLE entry (not the dataset one)
  4. You should see the INSIGHTS tab. Click that (if it requires you to enable any API, follow along and just enable APIs).

If you ended up enabling APIs at this point, you have to re-run the scan job again.

  1. In the INSIGHTS tab, you will see the EXTRACT button drop-down. Click that and select "Extract with SQL" option.

89809b34c610569d.png

In the "Extract with SQL" dialog pop up, set the DESTINATION dataset as the one that you saw in the result of the Datascan job. Start typing its name and it should show up in autocomplete. Click the "Extract" button. Alternatively you can create a new dataset at this point and extract.

This should open BigQuery Query Editor with a tab open populated with extracted SQL from the data scan inference.

8. SQL Validation & Schema Creation

If the generated query seems all good, and semantically relevant to your unstructured data, go ahead and run it by clicking the Run button on the query editor. It will take a few minutes to create the schema required for the structured storage of your unstructured media.

Once done, you should be able to verify the schema by expanding the dataset in the explorer pane of the BigQuery Studio as seen below:

6e65906240a9a8be.png

Alright!!! It was so sweet that we did all those database things really fast. Now it's time for the ultimate test!

9. The Ultimate Test!!!

Let's say I want my agent to respond to the user's questions with real, complete and well orchestrated information grounded in facts. I am going to ask a question that the agent would only be able to answer by referring to multiple media files and references from my source.

Here is my user question:

I'm really interested in your Midnight Swirl froyo. Are there any allergens in it?

Now a generic search or an LLM search will say "Zero ingredients". But we built a full semantic inference converting all our unstructured media into structured data. So here it goes, with a simple SQL that will fetch this information:

SELECT p.product_name, i.ingredient_name, a.allergen_name, category, stability
FROM froyo_data.consistsof c
INNER JOIN froyo_data.product p
  ON c.product_id = p.product_id
INNER JOIN froyo_data.ingredient i
  ON c.ingredient_id = i.ingredient_name
LEFT OUTER JOIN froyo_data.containsallergen a
  ON i.ingredient_id = a.ingredient_id
WHERE
  UPPER(p.product_name) LIKE '%MIDNIGHT%SWIRL%'
  AND allergen_name IS NOT NULL;

Woohoo! Look at the result:

f582d55814a23e8.png

10. Clean up

Once this lab is done, do not forget to delete the scan job and the BigQuery tables the job ended up creating.

Go to https://console.cloud.google.com/bigquery/governance/metadata-curation/cloud-storage-discovery. Select the job you want to delete by clicking on the vertical ellipsis next to it and click DELETE.

9fcbbfa1a9ccba6d.png

It should clean up the job.

11. Congratulations

Our implementation was successfully able to identify the hidden allergen. No more dark data, people!!! In part 2, we will federate this BigQuery data in a transactional system with AlloyDB to unify the data needs for our agentic application.