In this lab, you load a VCF file to BigQuery and analyze genetic variants with BigQuery. The VCF file that you use is the output of the the previous lab of this quest where variant calling from BAM records was carried out using DeepVariant.
In this lab, you learn how to:
In the search bar to the right of your project name, type Genomics and click on Genomics API. If the API is not already enabled, click on Enable. This will take a couple of minutes to enable.
Click the navigation menu on the top-left and select IAM & admin | Quotas. Find the zone where you have sufficient Compute Engine quotas (at least 8 CPUs).
From the GCP Console click the Cloud Shell icon on the top right toolbar:
Then click "Start Cloud Shell":
It should only take a few moments to provision and connect to the environment:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.
Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
In CloudShell, type:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Verify that the input VCF file exists:
gsutil ls gs://cloud-training-demos/genomics/rice_*.vcf
gsutil is a tool that can list blobs in Google Cloud Storage.
Verify that the output BigQuery dataset/table does NOT exist:
bq ls -d rice3k
bq is a tool that can list datasets in Google BigQuery.
Examine the script you are about to run:
cd training-data-analyst/quests/genomics/vcf-to-bigquery cat run.sh
What parameter does the script expect? What parameters does the script hardcode? What does the script do?
______________________________
______________________________
______________________________
______________________________
Answer:
The script expects for the zone in which to carry out the compute job. It hardcodes the input VCF file and the output BigQuery dataset and table. The script launches a genomics pipeline passing in a YAML file (vcf_bq.yaml) and a set of inputs.
Examine the YAML file that specifies the actual genomics pipeline job:
cat vcf_bq.yaml
What technology is used to extract the VCF file, transform it, and load it into BigQuery?
___________________________________
Answer:
Note the --runner argument. Cloud Dataflow is an executor for Apache Beam pipelines. Cloud Dataflow is an excellent choice for autoscaled Extract-Transform-Load (ETL) jobs.
Execute the script that launches the program to do variant calling:
bash ./run.sh us-central1-c
If successful, you will see a line such as:
Running [operations/EJfz0Oq1LBjUmsP2hdaT7ugBIJzIj7SnEyoPcHJvZHVjdGlvblF1ZXVl]
.
Note down the operation id: __________________________________________________________
Check the status of the job by typing:
gcloud alpha genomics operations describe <operation-id>
replacing <operation-id> in the command above by operations/some_long_string that you got in the previous step.
The "events" part of the output will provide a list of tasks that the pipeline has executed so far.
You can wait for the task to be done by setting the CMD variable to:
CMD="gcloud --format='value(done)' alpha genomics operations describe <operation-id>"
As before, make sure to replace <operation-id> in the command above by your actual operation id. Then, run the above command in a loop by executing:
while [[ $(eval ${CMD}) != "True" ]]; do echo "Sleeping for 30 seconds"; sleep 30; done
This will take 10-20 minutes to complete.
When the job is complete, navigate to the BigQuery section of the GCP web console and examine the tables in a new dataset called rice3k
that has been created.
From the BigQuery section of the GCP console, click on the ERS467753 table. Click on the Schema, Details and Preview tabs and answer these questions:
Answers:
In the BigQuery console, type this query:
#standardsql SELECT start_position, reference_name, reference_bases AS original FROM rice3k.ERS467753 AS v LIMIT 10
What does the above query do? ________________________________
Answer:
It shows you 10 rows of reference bases.
Add a WHERE clause to the above query to filter on rows that have exactly one alternate base and add this base to the SELECT. Hint: alternate_bases is an array, so you will need ARRAY_LENGTH.
Answer:
#standardsql SELECT start_position, reference_name, reference_bases AS original, alternate_bases[ORDINAL(1)].alt AS changed FROM rice3k.ERS467753 AS v WHERE ARRAY_LENGTH(alternate_bases) = 1 LIMIT 10
The result set might look like this:
Add a WHERE clause so that the "changed" column only contains a single letter that is A, C, G or T. Hint: use the IN operator of the array type.
Answer:
#standardsql SELECT start_position, reference_name, reference_bases AS original, alternate_bases[ORDINAL(1)].alt AS changed FROM rice3k.ERS467753 AS v WHERE ARRAY_LENGTH(alternate_bases) = 1 AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T') LIMIT 10
Modify the SELECT part of the statement so that you get a result set that looks like this:
Hint: Use the CONCAT method of the String type.
Answer:
#standardsql SELECT start_position, reference_name, CONCAT( reference_bases, '->', alternate_bases[ORDINAL(1)].alt) AS mutation FROM rice3k.ERS467753 AS v WHERE ARRAY_LENGTH(alternate_bases) = 1 AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T') LIMIT 10
Find the five most frequent mutations. You want a result set that looks like this:
Hint: You will need a subquery or WITH statement, and you will need to group by MUTATION.
Answer:
#standardsql WITH mutations AS ( SELECT start_position, reference_name, CONCAT( reference_bases, '->', alternate_bases[ORDINAL(1)].alt) AS mutation FROM rice3k.ERS467753 AS v WHERE ARRAY_LENGTH(alternate_bases) = 1 AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T') ) SELECT mutation, COUNT(mutation) AS num_mutations FROM mutations GROUP BY mutation ORDER BY num_mutations DESC LIMIT 5
In the BigQuery console, type this query:
#standardsql SELECT sample, ref, bin, SUM(n) AS numvariants FROM ( SELECT call.name AS sample, reference_name AS ref, FLOOR(start_position/1000000) AS bin, 1 AS n FROM `rice3k.ERS467753` JOIN UNNEST(call) AS call JOIN UNNEST(alternate_bases) AS alt WHERE alt.alt != '<*>' ) AS x GROUP BY sample, ref, bin ORDER BY numvariants DESC LIMIT 10
What does the above query do? ________________________________
Answer:
It counts variants per 1 megabase bin, per sample, per chromosome and ranks the bins in descending order of the number of variants. Top entries in the table are "hotspots".
We'd like to see if there are regions in the genome that are more likely than others to acquire variants (perhaps because those regions are under high natural selective pressure). In order to do that, we need more than just one sample; we need a lots of them. We went through the process described in this codelab, except for 3000 rice genomes and saved it into a public BigQuery dataset. Run the same query as above, but change in the dataset being queried to:
rice-3k.deepvariant_vcf_load8.Os_Nipponbare_Reference_IRGSP_1_0
In the above, rick-3k is the project name, deepvariant_vcf_load8 is the dataset name and Os_Nipponbare_Reference_IRGSP_1_0 is the table name.
Which megabase bin is under the greatest natural selective pressure? ______________________________
Let's modify the query above so that we are finding samples where the genomic variation is unusual -- in other words, if all the 3000 rice genomes tend to have a hotspot in some megabase, we are not as interested as if the hotspot exists only in this individual. One measure of how unusual something is the z-score -- assuming that the distribution is normal, a z-score with a high magnitude is something in the tails of the distribution:
High magnitudes of the z-score are associated with samples in the tails of the distribution. Image from Wikipedia.
Here's a query that finds such samples:
#standardsql WITH ind AS ( -- count variants for each sample/ref/bin SELECT call.name AS sample, reference_name AS ref, FLOOR(start_position/1000000) AS bin, COUNT(call.name) AS n FROM `rice-3k.deepvariant_vcf_load8.Os_Nipponbare_Reference_IRGSP_1_0` JOIN UNNEST(call) AS call JOIN UNNEST(alternate_bases) AS alt WHERE alt.alt != '<*>' AND call.name LIKE '%ERS467%' GROUP BY sample, ref, bin ), pop AS ( -- overall all samples in ref/bin SELECT ref, bin, AVG(n) AS pop_mu, STDDEV(n) AS pop_sigma FROM ind GROUP BY ref, bin ), zscore AS ( SELECT ind.sample, ind.n AS ind_n, (ind.n-pop.pop_mu)/pop.pop_sigma AS z, pop.ref, pop.bin, pop.pop_mu, pop.pop_sigma FROM pop, ind WHERE ind.ref = pop.ref AND ind.bin = pop.bin ) SELECT * from zscore ORDER BY ABS(Z) DESC
If you are not continuing to the next lab in the quest, navigate to the GCP Console and delete the following resources: