Ingest FHIR (Fast Healthcare Interoperability Resources) to BigQuery

1. Introduction

1f4764c4ca4c50d9.png

Last Updated: 2020-05-16

This codelab demonstrates a data ingestion pattern to ingest FHIR R4 formatted healthcare data (Regular Resources) into BigQuery using Cloud Healthcare FHIR APIs. Realistic healthcare test data has been generated and made available in the Google Cloud Storage bucket (gs://hcls_testing_data_fhir_10_patients/) for you.

In this code lab you will learn:

  • How to import FHIR R4 resources from GCS into Cloud Healthcare FHIR Store.
  • How to export FHIR data from FHIR Store to a Dataset in BigQuery.

What do you need to run this demo?

  • You need access to a GCP Project.
  • You must be assigned an Owner role to the GCP Project.
  • FHIR R4 resources in NDJSON format (content-structure=RESOURCE)

If you don't have a GCP Project, follow these steps to create a new GCP Project.

FHIR R4 resources in NDJSON format has been pre-loaded into GCS bucket at the following locations:

  • gs://hcls_testing_data_fhir_10_patients/fhir_r4_ndjson/ - Regular Resources

All of the resources above have new line delimiter JSON (NDJSON) file format but different content structure:

  • Regular Resources in ndjson format - each line in the file contains a core FHIR resource in JSON format (like Patient, Observation, etc). Each ndjson file contains FHIR resources of the same resource type. For example Patient.ndjson will contain one or more FHIR resources of resourceType = ‘ Patient' and Observation.ndjson will contain one or more FHIR resources of resourceType = ‘ Observation'.

If you need a new dataset, you can always generate it using SyntheaTM. Then, upload it to GCS instead of using the bucket provided in codelab.

2. Project Setup

Follow these steps to enable Healthcare API and grant required permissions:

Initialize shell variables for your environment

To find the PROJECT_NUMBER and PROJECT_ID, refer to Identifying projects.

<!-- CODELAB: Initialize shell variables -->
export PROJECT_ID=<PROJECT_ID>
export PROJECT_NUMBER=<PROJECT_NUMBER>
export SRC_BUCKET_NAME=hcls_testing_data_fhir_10_patients
export BUCKET_NAME=<BUCKET_NAME>
export DATASET_ID=<DATASET_ID>
export FHIR_STORE=<FHIR_STORE>
export BQ_DATASET=<BQ_DATASET>

Enable Healthcare API

Following steps will enable Healthcare APIs in your GCP Project. It will add Healthcare API service account to the project.

  1. Go to the GCP Console API Library.
  2. From the projects list, select your project.
  3. In the API Library, select the API you want to enable. If you need help finding the API, use the search field and the filters.
  4. On the API page, click ENABLE.

Get access to the synthetic dataset

  1. From the email address you are using to login to Cloud Console, send an email to hcls-solutions-external+subscribe@google.com requesting to join.
  2. You will receive an email with instructions on how to confirm the action.
  3. Use the option to respond to the email to join the group.
  4. DO NOT click the 525a0fa752e0acae.pngbutton. It doesn't work.
  5. Once you receive the confirmation email, you can proceed to the next step in the codelab.

Create a Google Cloud Storage bucket in your GCP Project

gsutil mb gs://$BUCKET_NAME

Copy synthetic data to your GCP Project

gsutil -m cp gs://$SRC_BUCKET_NAME/fhir_r4_ndjson/**.ndjson \
gs://$BUCKET_NAME/fhir_r4_ndjson/

Grant Permissions

Before importing FHIR resources from Cloud Storage and exporting to BigQuery, you must grant additional permissions to the Cloud Healthcare Service Agent service account. For more information, see FHIR store Cloud Storage and FHIR store BigQuery permissions.

Grant Storage Admin Permission

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-healthcare.iam.gserviceaccount.com \
  --role=roles/storage.objectViewer

Grant BigQuery Admin Permissions

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member=serviceAccount:service-$PROJECT_NUMBER@gcp-sa-healthcare.iam.gserviceaccount.com \
  --role=roles/bigquery.admin

3. Environment Setup

Follow these steps to ingest data from NDJSON files to healthcare dataset in BigQuery using Cloud Healthcare FHIR APIs:

Create Healthcare Dataset and FHIR Store

Create Healthcare dataset using Cloud Healthcare APIs

gcloud beta healthcare datasets create $DATASET_ID --location=us-central1

Create FHIR Store in dataset using Cloud Healthcare APIs

gcloud beta healthcare fhir-stores create $FHIR_STORE \
  --dataset=$DATASET_ID --location=us-central1 --version=r4

4. Import data into FHIR Store

Import test data from Google Cloud Storage to FHIR Store.

We will use preloaded files from GCS Bucket. These files contain FHIR R4 regular resources in the NDJSON format. As a response, you will get OPERATION_NUMBER, which can be used in the validation step.

Import Regular Resources from the GCS Bucket in your GCP Project

gcloud beta healthcare fhir-stores import gcs $FHIR_STORE \
  --dataset=$DATASET_ID --async \
  --gcs-uri=gs://$BUCKET_NAME/fhir_r4_ndjson/**.ndjson \
  --location=us-central1 --content-structure=RESOURCE

Validate

The validate operation finished successfully. It might take a few minutes of operation to finish, so you might need to repeat this command a few times with some delay.

gcloud beta healthcare operations describe OPERATION_NUMBER \
  --dataset=$DATASET_ID --location=us-central1

5. Export data from FHIR Store to BigQuery

Create a BigQuery Dataset

bq mk --location=us --dataset $PROJECT_ID:$BQ_DATASET

Export healthcare data from FHIR Store to BigQuery Dataset

gcloud beta healthcare fhir-stores export bq $FHIR_STORE \
  --dataset=$DATASET_ID --location=us-central1 --async \
  --bq-dataset=bq://$PROJECT_ID.$BQ_DATASET \
  --schema-type=analytics

As a response, you will get OPERATION_NUMBER, which can be used in the validation step.

Validate

Validate operation finished successfully

gcloud beta healthcare operations describe OPERATION_NUMBER \
  --dataset=$DATASET_ID --location=us-central1

Validate if BigQuery Dataset has all 16 tables

bq ls $PROJECT_ID:$BQ_DATASET

6. Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial, you can clean up the resources that you created on GCP so they won't take up your quota, and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page. GO TO THE PROJECTS PAGE
  2. In the project list, select the project you want to delete and click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

If you need to keep the project, you can delete the Cloud healthcare dataset and BigQuery dataset using the following instructions.

Delete the Cloud Healthcare API dataset

Follow the steps to delete Healthcare API dataset using both GCP Console and gcloud CLI.

Quick CLI command:

gcloud beta healthcare datasets delete $DATASET_ID --location=us-central1

Delete the BigQuery dataset

Follow the steps to delete BigQuery dataset using different interfaces.

Quick CLI command:

bq rm -r -f $PROJECT_ID:$DATASET_ID

7. Congratulations

Congratulations, you've successfully completed the code lab to ingest healthcare data in BigQuery using Cloud Healthcare APIs.

You imported FHIR R4 compliant synthetic data from Google Cloud Storage into the Cloud Healthcare FHIR APIs.

You exported data from the Cloud Healthcare FHIR APIs to BigQuery.

You now know the key steps required to start your Healthcare Data Analytics journey with BigQuery on Google Cloud Platform.