Last Updated: 2019-08-23

This codelab demonstrates near real time data ingestion pattern to ingest FHIR STU3 formatted healthcare data (Regular Resources and Bundles) into BigQuery (BQ) using Cloud Healthcare FHIR APIs. Realistic healthcare test data has been generated and made available in the Google Cloud Storage bucket for you.

In this code lab you will learn:

What do you need to run this demo?

If you don't have a GCP Project, follow these steps to create a new GCP Project.

FHIR STU3 resources and bundles in JSON format has been pre-loaded into GCS bucket at:

All of the resources above are in the JSON file format, but with a different content structure:

If you need a new dataset, you can always generate it using SyntheaTM. Then, upload it to GCS instead of copying it from the bucket at the Ingest Data step. However, you might need to transform them.

Follow the instructions in this section to enable Healthcare API and grant the required permissions.

In the next step and further, when you need to execute commands in a command line interface (CLI), the expectation is that you are using Cloud Shell, or another CLI with Cloud SDK installed and initialized.

Initialize shell variables for your environment

To find the PROJECT_NUMBER and PROJECT_ID, refer to Identifying projects.

<!-- CODELAB: Initialize shell variables →
<!-- Your current project ID -->
PROJECT_ID=<PROJECT_ID>
<!-- Your current project number -->
PROJECT_NUMBER=<PROJECT_NUMBER>
<!-- Name of healthcare and BigQuery dataset in your current project -->
DATASET_ID=<DATASET_ID>
<!-- Name of the fhir store in the healthcare dataset -->
FHIR_STORE=<FHIR_STORE>

Enable Healthcare API

This will add Healthcare API service account to the project.

  1. Go to the GCP Console API Library.
  2. From the projects list, select your project.
  3. In the API Library, select the API you want to enable. If you need help finding the API, use the search field and the filters.
  4. On the API page, click ENABLE.

Grant Permissions

Before ingesting FHIR resources to GCP Healthcare API and exporting to BigQuery, you must grant additional permissions to the Cloud Healthcare Service Agent service account. For more information see FHIR store BigQuery permissions.

Grant BigQuery Admin Permissions

Execute command:

gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-healthcare.iam.gserviceaccount.com \
    --role=roles/bigquery.admin

Follow these steps to prepare GCP project for data ingestion:

Create a BigQuery Dataset

bq mk --location=us --dataset $PROJECT_ID:$DATASET_ID

Create Healthcare Dataset and FHIR Store

Create Healthcare dataset using Cloud Healthcare APIs

gcloud alpha healthcare datasets create $DATASET_ID --location=us-central1

Create FHIR Store in dataset using POST request to Healthcare API for fhirStores

To enable near real time streaming of data from the FHIR store to a BigQuery dataset, the StreamConfig section must be included to request. Execute the CURL request:

curl -X POST \
  "https://healthcare.googleapis.com/v1alpha2/projects/$PROJECT_ID/locations/us-central1/datasets/$DATASET_ID/fhirStores?fhirStoreId=$FHIR_STORE" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d "{
  \"name\": \"$FHIR_STORE\",
  \"enableUpdateCreate\": true,
  \"disableReferentialIntegrity\": \"true\",
  \"streamConfigs\": [
    {
    \"resourceTypes\":[],        
    \"bigqueryDestination\": {
      \"datasetUri\": \"bq://$PROJECT_ID.$DATASET_ID\",
      \"schemaConfig\":
        { 
        \"schemaType\": \"ANALYTICS\",
        \"recursiveStructureDepth\": \"2\"
        }
      }
    }
  ]
}"

The StreamConfig process only Create, Update and Delete functions applied to GCP Healthcare API resources. So, ingesting should be done with a POST request targeting the fhir.Create method. Resources and Bundles should be ingested one at a time.

We recommend to use the Regular Resources since they showcase streaming better. If both are ingested, you will get double records for each resource.

We have provided a bash script to read a json file line by line. This script also inserts records from the JSON file to the FHIR Store.

Use preloaded files from the Google Cloud Storage (GCS) bucket. These files contain FHIR STU3 Regular Resources and Transaction Bundles in JSON format.

Regular Resources

Copy the Data from GCS using the gsutil tool:

gsutil -m cp -r gs://hcls-public-data-fhir-subset/fhir_stu3_json .

Copy the script from GCS and make it executable:

gsutil cp gs://hcls-codelabs-scripts/fhir-realtime-ingestion/ingest_regular_resources.sh . \
&& chmod u+x ingest_regular_resources.sh

Execute the script:

./ingest_regular_resources.sh fhir_stu3_json

Transaction Bundles

Copy the Data from GCS using the gsutil tool:

gsutil -m cp -r gs://hcls-public-data-fhir-subset/fhir_stu3_transaction_json .

Copy the script from GCS and make it executable:

gsutil cp gs://hcls-codelabs-scripts/fhir-realtime-ingestion/ingest_transaction_bundles.sh . \
&& chmod u+x ingest_transaction_bundles.sh

Execute the script:

./ingest_transaction_bundles.sh fhir_stu3_transaction_json

Scripts are designed to execute for about 2–4 minutes. Resources are added one at a time, so you can track the progress near real time using the instructions in the next section.

Also, scripts output logs to the same directory where resources are imported from, but with the extension "*.import.log".

Validate resources have been streamed to BQ successfully. Repeat the execution of commands in this section to see progress.

Validate if BigQuery Dataset has all Tables and Views

There should be 17 tables and 17 corresponding views.

bq ls $PROJECT_ID:$DATASET_ID

Query tables to count resources

bq query --nouse_legacy_sql "SELECT
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.AllergyIntolerance\`) as AllergyIntolerance,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Basic\`) as Basic,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.CarePlan\`) as CarePlan,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Claim\`) as Claim,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Condition\`) as Condition,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.DiagnosticReport\`) as DiagnosticReport" && \
bq query --nouse_legacy_sql "SELECT
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Encounter\`) as Encounter,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.ExplanationOfBenefit\`) as ExplanationOfBenefit,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Goal\`) as Goal,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.ImagingStudy\`) as ImagingStudy,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Immunization\`) as Immunization" && \
bq query --nouse_legacy_sql "SELECT
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.MedicationRequest\`) as MedicationRequest,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Observation\`) as Observation,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Organization\`) as Organization,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Patient\`) as Patient,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Practitioner\`) as Practitioner,
  (SELECT count(*) FROM \`$PROJECT_ID.$DATASET_ID.Procedure\`) as Procedure"

After ingestion is completely finished, the expected result is:

+--------------------+-------+----------+-------+-----------+------------------+
| AllergyIntolerance | Basic | CarePlan | Claim | Condition | DiagnosticReport |
+--------------------+-------+----------+-------+-----------+------------------+
|                  7 |    12 |       23 |   315 |        57 |               44 |
+--------------------+-------+----------+-------+-----------+------------------+
+-----------+----------------------+------+--------------+--------------+
| Encounter | ExplanationOfBenefit | Goal | ImagingStudy | Immunization |
+-----------+----------------------+------+--------------+--------------+
|       265 |                  265 |    5 |            2 |          178 |
+-----------+----------------------+------+--------------+--------------+
+-------------------+-------------+--------------+---------+--------------+-----------+
| MedicationRequest | Observation | Organization | Patient | Practitioner | Procedure |
+-------------------+-------------+--------------+---------+--------------+-----------+
|                50 |        1024 |           19 |      12 |           19 |       248 |
+-------------------+-------------+--------------+---------+--------------+-----------+

Apply fhir.Update and fhir.Delete methods to update and delete data, and validate they are represented in the BQ Table and View.

Query Patients and Set Variables

Execute the query to get ids for two patients that you will update and delete:

bq query --nouse_legacy_sql "SELECT id, name.family, name.given, meta_tag.code
 FROM ((SELECT id, name, meta FROM \`$PROJECT_ID.$DATASET_ID.Patient\`) as P
 CROSS JOIN UNNEST(P.name) as name
 CROSS JOIN UNNEST(P.meta.tag) as meta_tag)  
  WHERE name.family in ('Bruen238', 'Funk324')
  ORDER BY name.family"

The expected output should look like the following, except the ids are real values and codes are CREATE, which means the last function applied to these patients was fhir.Create:

+--------------------------------------+----------+-----------------+--------+
|                  id                  |  family  |      given      |  code  |
+--------------------------------------+----------+-----------------+--------+
|            <patient_1_id>            | Bruen238 | ["Svetlana462"] | CREATE |
|            <patient_2_id>            | Funk324  | ["Rocco842"]    | CREATE |
+--------------------------------------+----------+-----------------+--------+

Export patient ids to environment variables:

export PATIENT_1_ID=<patient_1_id>
export PATIENT_2_ID=<patient_2_id>

Get first Patient and Change Data

Request the patient resource and save it to a file:

curl -X GET \
  "https://healthcare.googleapis.com/v1alpha2/projects/$PROJECT_ID/locations/us-central1/datasets/$DATASET_ID/fhirStores/$FHIR_STORE/fhir/Patient/$PATIENT_1_ID" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' > Patient_$PATIENT_1_ID.json

Update the family.name field to emulate the first patient married who decided to take their spouse's family name:

sed -i 's/Bruen238/Bruen239/g' Patient_$PATIENT_1_ID.json

Update first Patient

Submit changes to the FHIR Store:

curl -X PUT \
  "https://healthcare.googleapis.com/v1alpha2/projects/$PROJECT_ID/locations/us-central1/datasets/$DATASET_ID/fhirStores/$FHIR_STORE/fhir/Patient/$PATIENT_1" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d @Patient_$PATIENT_1_ID.json

Delete second Patient

Delete the second patient from the FHIR Store:

curl -X DELETE \
  "https://healthcare.googleapis.com/v1alpha2/projects/$PROJECT_ID/locations/us-central1/datasets/$DATASET_ID/fhirStores/$FHIR_STORE/fhir/Patient/$PATIENT_2_ID" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json'

Validate all changes have been applied to the BQ Tables and Views as expected.

All Create, Update and Delete functions applied to the GCP Healthcare API resources areappended by StreamConfig to the BQ Tables, but Views always shows the latest status, which means:

Query Patient Table

Include new family name into WHERE condition:

bq query --nouse_legacy_sql "SELECT id, name.family, name.given, meta_tag.code
 FROM ((SELECT id, name, meta FROM \`$PROJECT_ID.$DATASET_ID.Patient\`) as P
 CROSS JOIN UNNEST(P.name) as name
 CROSS JOIN UNNEST(P.meta.tag) as meta_tag)  
  WHERE name.family in ('Bruen238', 'Bruen239', 'Funk324')
  ORDER BY name.family"

Validate double entries appeared for both Patients, but the new entries have the codes UPDATE and DELETE. Also, the original patient "Rosalie939 Cassin499" has a new family name in output:

+--------------------------------------+----------+-----------------+--------+
|                  id                  |  family  |      given      |  code  |
+--------------------------------------+----------+-----------------+--------+
|            <patient_1_id>            | Bruen238 | ["Svetlana462"] | CREATE |
|            <patient_1_id>            | Bruen239 | ["Svetlana462"] | UPDATE |
|            <patient_2_id>            | Funk324  | ["Rocco842"]    | DELETE |
|            <patient_2_id>            | Funk324  | ["Rocco842"]    | CREATE |
+--------------------------------------+----------+-----------------+--------+

Query PatientView

Run the same query, but against PatientView:

bq query --nouse_legacy_sql "SELECT id, name.family, name.given, meta_tag.code
 FROM ((SELECT id, name, meta FROM \`$PROJECT_ID.$DATASET_ID.PatientView\`) as P
 CROSS JOIN UNNEST(P.name) as name
 CROSS JOIN UNNEST(P.meta.tag) as meta_tag)  
  WHERE name.family in ('Bruen238', 'Bruen239', 'Funk324')
  ORDER BY name.family"

Validate only one record for the original Patient "Svetlana462 Bruen238" with the new family name "Bruen239" in output. Patient "Rocco842 Funk324" has been removed:

+--------------------------------------+----------+-----------------+--------+
|                  id                  |  family  |      given      |  code  |
+--------------------------------------+----------+-----------------+--------+
|             <patient_1_id>           | Bruen239 | ["Svetlana462"] | UPDATE |
+--------------------------------------+----------+-----------------+--------+

To ensure no issues appear during the sync to BQ, you can monitor logs. To access logs, open Google Cloud Console for your project, and in the navigation menu (left upper corner) click on Logging under the Stackdriver section:

In the Logging menu, pick up the BigQuery Dataset as the resource, and the Log Level you are interested in. The UI shows you relevant logs interactively:

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial, you can clean up the resources that you created on GCP so they won't take up your quota, and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.
    GO TO THE PROJECTS PAGE
  2. In the project list, select the project you want to delete and click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the Cloud Healthcare API dataset

Follow the steps to delete Healthcare API dataset using both GCP Console and gcloud CLI.

Quick CLI command:

gcloud alpha healthcare datasets delete $DATASET_ID --location=us-central1

Delete the BigQuery dataset

Follow the steps to delete BigQuery dataset using different interfaces.

Quick CLI command:

bq rm -r -f $PROJECT_ID:$DATASET_ID

Congratulations, you've successfully completed the code lab to ingest healthcare data near real time in BigQuery using Cloud Healthcare APIs.

You've created a FHIR Store with streaming support.

You've imported FHIR STU3 data from Google Cloud Storage into the Cloud Healthcare FHIR APIs.

You now know the key steps required to start your Healthcare Data Analytics journey with BigQuery on Google Cloud Platform.