Last Updated: 2019-10-07

This codelab implements a pattern to access and analyze healthcare data aggregated in BigQuery using BigQueryUI and AI Platform Notebooks. It illustrates data exploration of large healthcare datasets using familiar tools like Pandas, Matplotlib, etc. in a HIPPA compliant AI Platform Notebooks. The "trick" is to do the first part of your aggregation in BigQuery, get back a Pandas dataset and then work with the smaller Pandas dataset locally. AI Platform Notebooks provides a managed Jupyter experience and so you don't need to run notebook servers yourself. AI Platform Notebooks is well integrated with other GCP services like Big Query and Cloud Storage which makes it quick and simple to start your Data Analytics and ML journey on Google Cloud Platform.

In this code lab you will learn to:

What do you need to run this codelab?

If you don't have a GCP Project, follow these steps to create a new GCP Project.

For this codelab, we will use an existing dataset in BigQuery. This dataset is pre-populated with synthetic healthcare data.

  1. In GCP Console, select your project, then navigate to BigQuery.
  2. Click the +ADD DATA dropdown, and select "Pin a project".

  1. Enter the project name, "hcls-public-data", then click PIN. The BigQuery test dataset "hcls_test_data" is available for use.

BigQuery UI setting

  1. Navigate to the BigQuery console by selecting BigQuery from the top-left-corner ("hamburger") GCP menu.
  2. In the BigQuery console, click More → Query settings and ensure that the Legacy SQL menu is NOT checked (we will be using Standard SQL).

Build Queries

In the Query editor window, type the following query and click "Run" to run the query. Then, view the results in the Query results window.

  1. QUERY PATIENTS

#standardSQL - Query Patients
SELECT
  id AS patient_id,
  name[safe_offset(0)].given AS given_name,
  name[safe_offset(0)].family AS family,
  telecom[safe_offset(0)].value AS phone,
  birthDate AS birth_date,
  deceased.dateTime AS deceased_datetime,
  Gender AS fhir_gender_code,
  Address[safe_offset(0)].line AS address1_line_1,
  Address[safe_offset(0)].city AS address1_city,
  Address[safe_offset(0)].state AS address1_state,
  Address[safe_offset(0)].postalCode AS address1_postalCode,
  Address[safe_offset(0)].country AS address1_country
FROM 
  `BQ_DATASET_NAME.Patient` AS Patient 
LIMIT 10

  1. QUERY PRACTITIONERS

#standardSQL - Query Practitioners
SELECT 
  id AS practitioner_id, 
  name[safe_offset(0)].given AS given_name,
  name[safe_offset(0)].family AS family_name, 
  gender 
FROM 
  `BQ_DATASET_NAME.Practitioner` 
LIMIT 10

  1. QUERY ORGANIZATION

Change the organization id to match your dataset.

#standardSQL - Query Organization
SELECT
  id AS org_id,
  type[safe_offset(0)].text AS org_type,
  name AS org_name,
  address[safe_offset(0)].line AS org_addr,
  address[safe_offset(0)].city AS org_addr_city,
  address[safe_offset(0)].state AS org_addr_state,
  address[safe_offset(0)].postalCode AS org_addr_postalCode,
  address[safe_offset(0)].country AS org_addr_country
FROM 
  `BQ_DATASET_NAME.Organization` AS Organization
WHERE 
  id = "b81688f5-bd0e-3c99-963f-860d3e90ab5d"

  1. QUERY ENCOUNTERS BY PATIENT

#standardSQL - Query Encounters by Patient
SELECT
  id AS encounter_id,
  period.start AS encounter_start,
  period.end AS encounter_end,
  status AS encounter_status,
  class.code AS encounter_type,
  subject.patientId as patient_id,
  participant[safe_OFFSET(0)].individual.practitionerId as parctitioner_id,
  serviceProvider.organizationId as encounter_location_id,
  type[safe_OFFSET(0)].text AS encounter_reason
FROM
  `BQ_DATASET_NAME.Encounter` AS Encounter
WHERE
  subject.patientId = "295078f8-de60-4fed-bab7-c08642727aba"
ORDER BY
  encounter_end

  1. GET AVG LENGTH OF ENCOUNTERS BY ENCOUNTER TYPE

#standardSQL - Get Average length of Encounters by Encounter type 
SELECT
  class.code encounter_class,
  ROUND(AVG(TIMESTAMP_DIFF(TIMESTAMP(period.end),    TIMESTAMP(period.start), HOUR)),1) as avg_minutes
FROM
  `BQ_DATASET_NAME.Encounter` AS Encounter
WHERE
  period.end >= period.start
GROUP BY
  1
ORDER BY
  2 DESC

  1. GET ALL PATIENTS WHO HAVE A1C RATE >= 6.5

# Query Patients who have A1C rate >= 6.5
SELECT 
  id AS observation_id,
  subject.patientId AS patient_id,
  context.encounterId AS encounter_id,
  value.quantity.value,
  value.quantity.unit,
  code.coding[safe_offset(0)].code,
  code.coding[safe_offset(0)].display AS description
FROM
  `hcls-public-data.hcls_test_data.Observation` 
WHERE 
  code.text like '%A1c/Hemoglobin%' AND 
  value.quantity.value >= 6.5 AND 
  status = 'final'

Follow the instructions in this link to create a new AI Platform Notebooks (JupyterLab) instance.

Please make sure to enable Compute Engine API.

You can choose "Create a new notebook with default options" or "Create a new notebook and specify your instance options".

Open AI Platform Notebooks instance

In this section we will author and code a new Jupyter notebook from scratch.

  1. Open a notebook instance by navigating to the AI Platform Notebooks page in the Google Cloud Platform Console.
    GO TO THE AI PLATFORM NOTEBOOKS PAGE
  2. Select Open JupyterLab for the instance that you want to open.

  1. AI Platform Notebooks directs you to a URL for your notebook instance.

Create a notebook

  1. In JupyterLab, go to File -> New -> Notebook and select the Kernel "Python 3" in the popup, or select "Python 3" under the Notebook section in the launcher window to create an Untitled.ipynbnotebook.

  1. Right click Untitled.ipynb and rename the notebook to "fhir_data_from_bigquery.ipynb". Double click to open it, build the queries, and save the notebook.
  2. You can download a notebook by right clicking the *.ipynb file and select Download from the menu.

  1. You can also upload an existing notebook by clicking the "Up arrow" button.

Build and execute each code block in the notebook

Provide the Python code in each block and click the Run (Triangle) icon to execute the code.

  1. Get length of stay for encounters in hours

from google.cloud import bigquery

client = bigquery.Client()

lengthofstay="""
SELECT
    class.code as encounter_class,
    period.start as start_timestamp,
    period.end as end_timestamp, 
    TIMESTAMP_DIFF(TIMESTAMP(period.end), TIMESTAMP(period.start), HOUR) as length_of_stay_in_hours
FROM 
    fhir_dataset.Encounter
WHERE
    period.end >= period.start
ORDER BY
    4 DESC
LIMIT 10
"""
df = client.query(lengthofstay).to_dataframe()
df.head()

  1. Get Observations - Cholesterol values

observation="""
SELECT
  cc.code loinc_code,
  cc.display loinc_name,
  approx_quantiles(round(o.value.quantity.value,1),4) as quantiles,
  count(*) as num_obs
FROM
  fhir_dataset.Observation o, o.code.coding cc
WHERE
  cc.system like '%loinc%' and lower(cc.display) like '%cholesterol%'
GROUP BY 1,2
ORDER BY 4 desc
"""
df2 = client.query(observation).to_dataframe()
df2.head()

  1. Get Approximate encounter quantiles

encounters="""
SELECT
  encounter_class,
  APPROX_QUANTILES(num_encounters, 4) num_encounters_quantiles
FROM (
  SELECT
    class.code encounter_class,
    subject.reference patient_id,
    COUNT(DISTINCT id) AS num_encounters
  FROM
    fhir_dataset.Encounter
  GROUP BY
    1,2
  )
GROUP BY 1
ORDER BY 1
"""
df3 = client.query(encounters).to_dataframe()
df3.head()

  1. Get Average length of encounters in minutes

avgstay="""
SELECT
  class.code encounter_class,
  ROUND(AVG(TIMESTAMP_DIFF(TIMESTAMP(period.end), TIMESTAMP(period.start), MINUTE)),1) as avg_minutes
FROM
  fhir_dataset.Encounter
WHERE
  period.end >= period.start
GROUP BY
  1
ORDER BY
  2 DESC
  """
df4 = client.query(avgstay).to_dataframe()
df4.head()

  1. Get Encounters per patient

patientencounters="""
SELECT
  id AS encounter_id,
  period.start AS encounter_start,
  period.end AS encounter_end,
  status AS encounter_status,
  class.code AS encounter_type,
  subject.patientId as patient_id,
  participant[safe_OFFSET(0)].individual.practitionerId as parctitioner_id,
  serviceProvider.organizationId as encounter_location_id,
  type[safe_OFFSET(0)].text AS encounter_reason
FROM
  `fhir_dataset.Encounter` AS Encounter
WHERE
  subject.patientId = "295078f8-de60-4fed-bab7-c08642727aba"
ORDER BY
  encounter_end
"""

df5 = client.query(patientencounters).to_dataframe()
df5.head()

  1. Get Organizations

orgs="""
SELECT
  id AS org_id,
  type[safe_offset(0)].text AS org_type,
  name AS org_name,
  address[safe_offset(0)].line AS org_addr,
  address[safe_offset(0)].city AS org_addr_city,
  address[safe_offset(0)].state AS org_addr_state,
  address[safe_offset(0)].postalCode AS org_addr_postalCode,
  address[safe_offset(0)].country AS org_addr_country
FROM 
  `fhir_dataset.Organization` AS Organization
WHERE 
  id = "b81688f5-bd0e-3c99-963f-860d3e90ab5d"
"""

df6 = client.query(orgs).to_dataframe()
df6.head()

  1. Get Patients

patients="""
SELECT
  id AS patient_id,
  name[safe_offset(0)].given AS given_name,
  name[safe_offset(0)].family AS family,
  telecom[safe_offset(0)].value AS phone,
  birthDate AS birth_date,
  deceased.dateTime AS deceased_datetime,
  Gender AS fhir_gender_code,
  Address[safe_offset(0)].line AS address1_line_1,
  Address[safe_offset(0)].city AS address1_city,
  Address[safe_offset(0)].state AS address1_state,
  Address[safe_offset(0)].postalCode AS address1_postalCode,
  Address[safe_offset(0)].country AS address1_country
FROM 
  `fhir_dataset.Patient` AS Patient 
LIMIT 10
"""

df7 = client.query(patients).to_dataframe()
df7.head()

Execute code cells in notebook "fhir_data_from_bigquery.ipynb" to draw a bar graph.

For example, get the average length of Encounters in minutes.

  1. In the GCP Console, navigate to Source Repositories. If it is the first time you use it, click Get started, then Create repository.

  1. For a subsequence time, navigate to GCP-> Cloud Source Repositories, and click +Add repository to create a new repository.

  1. Select "Create a new Repository", then click Continue.
  2. Provide Repository name and Project name, then click Create.

  1. Select "Clone your repository to a local Git repository", then select Manually generated credentials.
  2. Follow step 1 "Generate and store Git credentials" instructions (see below). Copy and paste all commands from your "Configure Git" window into the Jupyter terminal.

  1. Clone the repository - Copy the repository clone path from the GCP Cloud source repositories (step 2 in screenshot below).

  1. In the Jupyter Lab, select Git -> Open terminal.
  2. In the terminal, paste this command:
git clone https://source.developers.google.com/p/hn-project-249821/r/my-ai-notebooks
  1. The "my-ai-notebooks" folder is created in Jupyterlab.

  1. Create an untitiled.txt file and rename it to ".gitignore".
  2. Move your notebook (fhir_data_from_bigquery.ipynb) into the folder "my-ai-notebooks".
  3. In the Jupyter terminal, change the directory to "cd my-ai-notebooks".
  4. Stage your changes using the Jupyter terminal or the Jupyter UI (right click the files in the Untracked area, select Stage, then the files are moved into the Tracked area, and vice versa. Changed area contains the modified files).
git remote add my-ai-notebooks https://source.developers.google.com/p/hn-project-249821/r/my-ai-notebooks

  1. Commit your changes using the Jupyter terminal or the Jupyter UI (type in the message, then click the "Checked" button).
git commit -m "message goes here"

  1. Push your changes to the remote repository using the Jupyter terminal or the Jupyter UI (click the "push committed changes" icon).
git push --all my-ai-notebooks
  1. In the GCP console, navigate to Source Repositories. Click on my-ai-notebooks. Notice that "fhir_data_from_bigquery.ipynb" is now saved in the GCP Source Repository.

To avoid incurring charges to your Google Cloud Platform account for the resources used in this codelab, after you've finished the tutorial, you can clean up the resources that you created on GCP so they won't take up your quota, and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Deleting the BigQuery dataset

Follow these instructions to delete the BigQuery dataset you created as part of this tutorial. Or, navigate to BigQuery console, UnPIN project hcls-public-data, if you used the test dataset hcls_test_data.

Shutting down AI Platform Notebooks instance

Follow instructions in this link https://cloud.google.com/ml-engine/docs/notebooks/shut-down to shutdown an AI Platform Notebooks instance.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.
    GO TO THE PROJECTS PAGE
  2. In the project list, select the project you want to delete and click Delete.
  3. In the dialog, type the project ID, then click Shut down to delete the project.

Congratulations, you've successfully completed the code lab to access, query, and analyze FHIR formatted healthcare data using BigQuery and AI Platform Notebooks.

You accessed a public BigQuery dataset in GCP.

You developed and tested SQL queries using BigQuery UI.

You created and launched an AI Platform Notebooks instance.

You executed SQL queries in JupyterLab and stored the query results in Pandas DataFrame.

You created charts and graphs using Matplotlib.

You committed and pushed your notebook to a Cloud Source Repository in GCP.

You now know the key steps required to start your Healthcare Data Analytics journey with BigQuery and AI Platform Notebooks on the Google Cloud Platform.

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.