In this lab, you write a pipeline to call variants from the BAM records using DeepVariant. DeepVariant is a genomics variant caller that uses deep neural networks to call genetic variants in germline genomes. Originally developed by Google Brain and Verily Life Sciences, DeepVariant won the 2016 PrecisionFDA Truth Challenge award for Highest SNP Performance. Since then, the Google Brain team has reduced the error rate by over 50%.

The output of the variant caller pipeline is a VCF file that can be loaded into BigQuery for further analysis. This is covered in the next lab of this quest.

What you learn

In this lab, you learn how to:

The DNA sequences of two individuals in a population will be different because of crossover between parents' genes and because of random mutations. For a wide variety of applications, it has become important to know an individual's genetic composition (or genotype). But do we need to store the whole sequence?

It turns out that, instead of storing the whole sequence, we can simply store the difference between an individual's DNA sequence and a reference individual, as illustrated in this image from the European Bioinformatics Institute:

In the image, the gray lines are "reads" from the sequencing instrument. The red letters indicate the areas where the reads differ from the reference DNA sequence (depicted in color). These differences are called variants, and the variants are stored in a tab-delimited text file called a Variant Caller Format (VCF) file.

Identifying variants requires a sophisticated algorithm such as GATK or DeepVariant. For one thing, the reads from the instrument have to be aligned to the reference DNA sequence. The reads may not exactly match any part of the reference because ... variants. Even though the image above shows differences in just one letter (or single nucleotide polymorphism), variants can be more complex -- entire subsequences might be deleted, inserted, inverted, duplicated, or duplicated repeatedly:

Image from European Bioinformatics Institute.

Variants are significant, not just because they allow more efficient storage, but because genetic differences between individuals can lead to differences in an individual's phenotype (such as their height or risk of developing a disease).

Variant calling is the process of going from sequences (typically stored in a BAMS files) to identifying variants between an individual and a reference genome (typically stored in a VCF file). Once we have the variants, then it is possible to do analysis on the variants to look for specific genetic markers.

Test your understanding:

  1. DeepVariant is an algorithm to do _________________
  2. There are _______ inputs to a variant caller.
  3. The inputs to a variant caller are _______________ and _________________.
  4. The output of a variant caller is __________________

Answers to "Test your understanding":

  1. Variant calling.
  2. Two. (see below)
  3. A reference genome and sequence reads belonging to an individual tissue. The reference could belong to the species or to the same individual organism (such as from the organism a few years prior, or from a part of the body unaffected by disease).
  4. A VCF file containing the variants between the reference genome and the individual tissue's genotype.

Step 1: Enable Genomics API

In the search bar to the right of your project name, type Genomics and click on Genomics API. If the API is not already enabled, click on Enable. This will take a couple of minutes to enable.

Step 2: Pick a zone

Select a compute zone from https://cloud.google.com/compute/docs/gpus/ that has K-80 GPUs.

Step 2: Verify quotas

Click the navigation menu on the top-left and select IAM & admin | Quotas. Change the filter to include only Compute Engine API and your zone. Then, ensure that you have the quota for at least 16 CPUs and 16 K-80 GPUs.

Step 3: Start CloudShell

Activate Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

Then click "Start Cloud Shell":

It should only take a few moments to provision and connect to the environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.

Once connected to the cloud shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

Run the following command in the cloud shell to confirm that you are authenticated:

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Step 1: Clone repository

In CloudShell, type:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

Verify that the reference genome file we will use is present on Google Cloud Storage:

gsutil ls -l gs://rice-3k/reference/Os-Nipponbare-Reference-IRGSP-1.0/Os-Nipponbare-Reference-IRGSP-1.0.fa

Os-Nipponbare-Reference-IRGSP is the current version of a standard rice genome.

Step 3

Verify that the sequence BAM file we will use is present on Google Cloud Storage:

gsutil ls -l gs://rice-3k/PRJEB6180/aligned-Os-Nipponbare-Reference-IRGSP-1.0/ERS467753.bam

This sequence, from the Beijing Genomics Institute, is from the same strain of rice as the reference genome.

Step 4

Move to the directory containing the scripts you will need for this lab:

cd training-data-analyst/courses/data_analysis/genomics/deepvariant-rice

Step 5

Examine the script that launches the program to do variant calling:

cat run_dv.sh

What parameters does the script expect?

______________________________

______________________________

______________________________

______________________________

______________________________

Why do you think the script needs these parameters? Where will the output file be located?

Answer:

The script asks for zone, YAML file, reference genome, BAM file. The zone is where to carry out the computation to find variants. The reference genome and BAM files are the inputs to the variant caller. The YAML file contains parameters to the variant caller such as the number of workers (CPUs/GPUs) to use in each stage of the process.

The bucket is for storing the output VCF file and intermediate logs. The script creates a bucket by concatenating the project name, the YAML prefix, and the timestamp.

Step 6

Execute the script that launches the program to do variant calling:

bash ./run_dv.sh \
       us-central1-c  \
       ./dv_rice.yaml \
       ERS467753 \
       gs://rice-3k/reference/Os-Nipponbare-Reference-IRGSP-1.0/Os-Nipponbare-Reference-IRGSP-1.0.fa \
       gs://rice-3k/tmp/ERS467753.bam

If successful, you will see a line such as:

Running [operations/EJfz0Oq1LBjUmsP2hdaT7ugBIJzIj7SnEyoPcHJvZHVjdGlvblF1ZXVl].

Note down the operation id: __________________________________________________________

Step 1

Check the status of the job by typing:

gcloud alpha genomics operations describe <operation-id>

replacing <operation-id> in the command above by operations/some_long_string that you got in the previous step.

The "events" part of the output will provide a list of tasks that the pipeline has executed so far.

Step 2

You can wait for the task to be done by setting the CMD variable to:

CMD="gcloud --format='value(done)' alpha genomics operations describe <operation-id>"

As before, make sure to replace <operation-id> in the command above by your actual operation id. Then, run the above command in a loop by executing:

while [[ $(eval ${CMD}) != "True" ]]; do echo "Sleeping for 30 seconds"; sleep 30; done

This will take 20-30 minutes to complete.

Step 3

In the Google Cloud Platform web console, navigate to the Storage browser and examine the files in your bucket. When the job is complete, you should see a .vcf file there. The VCF file can be loaded into BigQuery for further analysis. This is covered in the next lab of this quest.

If you are not continuing to the next lab in the quest, navigate to the GCP Console and delete the following resources: