In this lab, you load a VCF file to BigQuery and analyze variants with BigQuery. The VCF file that you use is the output of the the previous lab of this quest where variant calling from BAM records was carried out using DeepVariant.

What you learn

In this lab, you learn how to:

Step 1: Enable Genomics API

In the search bar to the right of your project name, type Genomics and click on Genomics API. If the API is not already enabled, click on Enable. This will take a couple of minutes to enable.

Step 2: Pick a zone

Click the navigation menu on the top-left and select IAM & admin | Quotas. Find the zone where you have sufficient Compute Engine quotas (at least 8 CPUs).

Step 3: Start CloudShell

Activate Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click "Start Cloud Shell":

It should only take a few moments to provision and connect to the environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.

Once connected to the cloud shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

Run the following command in the cloud shell to confirm that you are authenticated:

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Step 1: Clone repository

In CloudShell, type:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

Verify that the input VCF file exists:

gsutil ls gs://cloud-training-demos/genomics/rice_*.vcf

gsutil is a tool that can list blobs in Google Cloud Storage.

Step 3

Verify that the output BigQuery dataset/table does NOT exist:

bq ls -d rice3k

bq is a tool that can list datasets in Google BigQuery.

Step 4

Examine the script you are about to run:

cd training-data-analyst/quests/genomics/vcf-to-bigquery
cat run.sh

What parameter does the script expect? What parameters does the script hardcode? What does the script do?

______________________________

______________________________

______________________________

______________________________

Answer:

The script expects for the zone in which to carry out the compute job. It hardcodes the input VCF file and the output BigQuery dataset and table. The script launches a genomics pipeline passing in a YAML file (vcf_bq.yaml) and a set of inputs.

Step 5

Examine the YAML file that specifies the actual genomics pipeline job:

cat vcf_bq.yaml

What technology is used to extract the VCF file, transform it, and load it into BigQuery?

___________________________________

Answer:

Note the --runner argument. Cloud Dataflow is an executor for Apache Beam pipelines. Cloud Dataflow is an excellent choice for autoscaled Extract-Transform-Load (ETL) jobs.

Step 6

Execute the script that launches the program to do variant calling:

bash ./run.sh us-central1-c

If successful, you will see a line such as:

Running [operations/EJfz0Oq1LBjUmsP2hdaT7ugBIJzIj7SnEyoPcHJvZHVjdGlvblF1ZXVl].

Note down the operation id: __________________________________________________________

Step 1

Check the status of the job by typing:

gcloud alpha genomics operations describe <operation-id>

replacing <operation-id> in the command above by operations/some_long_string that you got in the previous step.

The "events" part of the output will provide a list of tasks that the pipeline has executed so far.

Step 2

You can wait for the task to be done by setting the CMD variable to:

CMD="gcloud --format='value(done)' alpha genomics operations describe <operation-id>"

As before, make sure to replace <operation-id> in the command above by your actual operation id. Then, run the above command in a loop by executing:

while [[ $(eval ${CMD}) != "True" ]]; do echo "Sleeping for 30 seconds"; sleep 30; done

This will take 10-20 minutes to complete.

Step 3

When the job is complete, navigate to the BigQuery section of the GCP web console and examine the tables in a new dataset called rice3k that has been created.

Step 1

From the BigQuery section of the GCP console, click on the ERS467753 table. Click on the Schema, Details and Preview tabs and answer these questions:

  1. How many rows are in the table? __________________
  2. Each row in the table represents a string of reference bases. How is this string represented? _____________________
  3. How are the alternate bases represented? _______________________
  4. Give an example of a reference base in the dataset. ______________
  5. Give an example of an alternate base in the dataset _______________

Answers:

  1. From the Number of Rows in the "Details" tab: 3,554,345
  2. From the Schema tab: start_position and end_position represent the first and one-past-the-end of the string of reference bases, and reference_bases represents the character.
  3. From the Schema tab: there is a list of alternate bases (note that it is a REPEATED field) associated with each row.
  4. Look in the preview tab. I saw letters like A, C, T and G.
  5. Look in the preview tab. I saw strings like AAT, ACC, AT, etc. What you see will vary because the preview displays up a random shard of the data.

Step 2

In the BigQuery console, type this query:

#standardsql
SELECT
  start_position,
  reference_name,
  reference_bases AS original
FROM
  rice3k.ERS467753 AS v
LIMIT 10

What does the above query do? ________________________________

Answer:

It shows you 10 rows of reference bases.

Step 3

Add a WHERE clause to the above query to filter on rows that have exactly one alternate base and add this base to the SELECT. Hint: alternate_bases is an array, so you will need ARRAY_LENGTH.

Answer:

#standardsql
SELECT
  start_position,
  reference_name,
  reference_bases AS original,
  alternate_bases[ORDINAL(1)].alt AS changed
FROM
  rice3k.ERS467753 AS v
WHERE
  ARRAY_LENGTH(alternate_bases) = 1
LIMIT 10

The result set might look like this:

Step 3

Add a WHERE clause so that the "changed" column only contains a single letter that is A, C, G or T. Hint: use the IN operator of the array type.

Answer:

#standardsql
SELECT
  start_position,
  reference_name,
  reference_bases AS original,
  alternate_bases[ORDINAL(1)].alt AS changed
FROM
  rice3k.ERS467753 AS v
WHERE
  ARRAY_LENGTH(alternate_bases) = 1
  AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T')
LIMIT 10

Step 4

Modify the SELECT part of the statement so that you get a result set that looks like this:

Hint: Use the CONCAT method of the String type.

Answer:

#standardsql
SELECT
  start_position,
  reference_name,
  CONCAT( reference_bases, '->', alternate_bases[ORDINAL(1)].alt) AS mutation
FROM
  rice3k.ERS467753 AS v
WHERE
  ARRAY_LENGTH(alternate_bases) = 1
  AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T')
LIMIT 10

Step 5

Find the five most frequent mutations. You want a result set that looks like this:

Hint: You will need a subquery or WITH statement, and you will need to group by MUTATION.

Answer:

#standardsql
WITH mutations AS (
SELECT
  start_position,
  reference_name,
  CONCAT( reference_bases, '->', alternate_bases[ORDINAL(1)].alt) AS mutation
FROM
  rice3k.ERS467753 AS v
WHERE
  ARRAY_LENGTH(alternate_bases) = 1
  AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T')
)

SELECT
  mutation,
  COUNT(mutation) AS num_mutations
FROM
  mutations
GROUP BY mutation
ORDER BY num_mutations DESC
LIMIT 5

Step 1

In the BigQuery console, type this query:

#standardsql
SELECT sample, ref, bin, SUM(n) AS numvariants
FROM (
SELECT
call.name AS sample, reference_name AS ref, FLOOR(start_position/1000000) AS bin, 1 AS n
FROM `rice3k.ERS467753` 
JOIN UNNEST(call) AS call
JOIN UNNEST(alternate_bases) AS alt
WHERE alt.alt != '<*>'
) AS x
GROUP BY sample, ref, bin
ORDER BY numvariants DESC
LIMIT 10

What does the above query do? ________________________________

Answer:

It counts variants per 1 megabase bin, per sample, per chromosome and ranks the bins in descending order of the number of variants. Top entries in the table are "hotspots".

Step 2

We'd like to see if there are regions in the genome that are more likely than others to acquire variants (perhaps because those regions are under high natural selective pressure). In order to do that, we need more than just one sample; we need a lots of them. We went through the process described in this codelab, except for 3000 rice genomes and saved it into a public BigQuery dataset. Run the same query as above, but change in the dataset being queried to:

rice-3k.deepvariant_vcf_load8.Os_Nipponbare_Reference_IRGSP_1_0


In the above, rick-3k is the project name, deepvariant_vcf_load8 is the dataset name and Os_Nipponbare_Reference_IRGSP_1_0 is the table name.

Which megabase bin is under the greatest natural selective pressure? ______________________________

Step 3

Let's modify the query above so that we are finding samples where the genomic variation is unusual -- in other words, if all the 3000 rice genomes tend to have a hotspot in some megabase, we are not as interested as if the hotspot exists only in this individual. One measure of how unusual something is the z-score -- assuming that the distribution is normal, a z-score with a high magnitude is something in the tails of the distribution:

High magnitudes of the z-score are associated with samples in the tails of the distribution. Image from Wikipedia.

Here's a query that finds such samples:

#standardsql
WITH ind AS (
  -- count variants for each sample/ref/bin
  SELECT
  call.name AS sample, reference_name AS ref, FLOOR(start_position/1000000) AS bin, COUNT(call.name) AS n
  FROM `rice-3k.deepvariant_vcf_load8.Os_Nipponbare_Reference_IRGSP_1_0` 
  JOIN UNNEST(call) AS call
  JOIN UNNEST(alternate_bases) AS alt
  WHERE alt.alt != '<*>' AND call.name LIKE '%ERS467%'
  GROUP BY sample, ref, bin
),

pop AS (
  -- overall all samples in ref/bin
  SELECT ref, bin, AVG(n) AS pop_mu, STDDEV(n) AS pop_sigma
  FROM ind
  GROUP BY ref, bin
),

zscore AS (
  SELECT 
    ind.sample, 
    ind.n AS ind_n,
    (ind.n-pop.pop_mu)/pop.pop_sigma AS z, 
    pop.ref, 
    pop.bin, 
    pop.pop_mu, 
    pop.pop_sigma
  FROM pop, ind
  WHERE ind.ref = pop.ref AND ind.bin = pop.bin
)

SELECT * from zscore
ORDER BY ABS(Z) DESC

If you are not continuing to the next lab in the quest, navigate to the GCP Console and delete the following resources: