Using BigQuery with C#

1. Overview

BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is NoOps—there is no infrastructure to manage and you don't need a database administrator—so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.

In this codelab, you will use Google Cloud Client Libraries for .NET to query BigQuery public datasets with C#.

What you'll learn

How to use the Cloud Shell
How to enable the BigQuery API
How to Authenticate API requests
How to install the Google Cloud client library for C#
How to query the works of Shakespeare
How to query the GitHub dataset
How to adjust caching and display statistics

What you'll need

A Google Cloud Platform Project
A Browser, such Chrome or Firefox
Familiarity using C#

Survey

How will you use this tutorial?

Read it through only

Read it and complete the exercises

How would you rate your experience with C#?

Novice

Intermediate

Proficient

How would you rate your experience with using Google Cloud Platform services?

Novice

Intermediate

Proficient

2. Setup and Requirements

Self-paced environment setup

Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.

Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

From the Cloud Console, click Activate Cloud Shell .

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue.

It should only take a few moments to provision and connect to Cloud Shell.

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:

gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Enable the BigQuery API

BigQuery API should be enabled by default in all Google Cloud projects. You can check whether this is true with the following command in the Cloud Shell: You should be BigQuery listed:

gcloud services list

You should see BigQuery listed:

NAME                              TITLE
bigquery-json.googleapis.com      BigQuery API
...

In case the BigQuery API is not enabled, you can use the following command in the Cloud Shell to enable it:

gcloud services enable bigquery-json.googleapis.com

4. Install the BigQuery client library for C#

First, create a simple C# console application that you will use to run BigQuery API samples.

dotnet new console -n BigQueryDemo

You should see the application created and dependencies resolved:

The template "Console Application" was created successfully.
Processing post-creation actions...
...
Restore succeeded.

Next, navigate to BigQueryDemo folder:

cd BigQueryDemo

And add Google.Cloud.BigQuery.V2 NuGet package to the project:

dotnet add package Google.Cloud.BigQuery.V2

info : Adding PackageReference for package 'Google.Cloud.BigQuery.V2' into project '/home/atameldev/BigQueryDemo/BigQueryDemo.csproj'.
log  : Restoring packages for /home/atameldev/BigQueryDemo/BigQueryDemo.csproj...
...
info : PackageReference for package 'Google.Cloud.BigQuery.V2' version '1.2.0' added to file '/home/atameldev/BigQueryDemo/BigQueryDemo.csproj'.

Now, you're ready to use BigQuery API!

5. Query the works of Shakespeare

A public dataset is any dataset that is stored in BigQuery and made available to the general public. There are many other public datasets available for you to query, some of which are also hosted by Google, but many more that are hosted by third parties. You can read more on the Public Datasets page.

In addition to the public datasets, BigQuery provides a limited number of sample tables that you can query. These tables are contained in the bigquery-public-data:samples dataset. One of those tables is called shakespeare. It contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

In this step, you will query the Shakespeare table.

First, open the code editor from the top right side of the Cloud Shell:

Navigate to the Program.cs file inside the BigQueryDemo folder and replace the code with the following. Make sure you replace projectId with your actual project id:

using System;
using Google.Cloud.BigQuery.V2;

namespace BigQueryDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            var client = BigQueryClient.Create("projectId");
            var table = client.GetTable("bigquery-public-data", "samples", "shakespeare");
            var sql = $"SELECT corpus AS title, COUNT(word) AS unique_words FROM {table} GROUP BY title ORDER BY unique_words DESC LIMIT 10";

            var results = client.ExecuteQuery(sql, parameters: null);

            foreach (var row in results)
            {
                Console.WriteLine($"{row["title"]}: {row["unique_words"]}");
            }
        }
    }
}

Take a minute or two to study the code and see how the table is being queried.

Back in Cloud Shell, run the app:

dotnet run

You should see a list of words and their occurrences:

hamlet: 5318
kinghenryv: 5104
cymbeline: 4875
troilusandcressida: 4795
kinglear: 4784
kingrichardiii: 4713
2kinghenryvi: 4683
...

6. Query the GitHub dataset

To get more familiar with BigQuery, you'll now issue a query against GitHub public dataset. You will find the most common commit messages on GitHub. You'll also use BigQuery 's Web console to preview and run ad-hoc queries.

To see how the data looks like, open the GitHub dataset in the BigQuery web UI:

https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=github_repos&t=commits&page=table

Get a quick preview of how the data looks, use the Preview button:

Navigate to the Program.cs file inside the BigQueryDemo folder and replace the code with the following. Make sure you replace projectId with your actual project id:

using System;
using Google.Cloud.BigQuery.V2;

namespace BigQueryDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            var client = BigQueryClient.Create("projectId");
            var table = client.GetTable("bigquery-public-data", "github_repos", "commits");
            
            var sql = $"SELECT subject AS subject, COUNT(*) AS num_duplicates FROM {table} GROUP BY subject ORDER BY num_duplicates DESC LIMIT 10";

            var results = client.ExecuteQuery(sql, parameters: null);

            foreach (var row in results)
            {
                Console.WriteLine($"{row["subject"]}: {row["num_duplicates"]}");
            }
        }
    }
}

Take a minute or two to study the code and see how the table is being queried for the most common commit messages.

Back in Cloud Shell, run the app:

dotnet run

You should see a list of commit messages and their occurrences:

Update README.md: 2509242
: 1971725
Initial commit: 1942149
Mirroring from Micro.blog.: 838586
update: 575188
Update data.json: 548651
Update data.js: 548339
Add files via upload: 379941
*** empty log message ***: 358528
Can't you see I'm updating the time?: 286863

7. Caching and statistics

After the initial query, BigQuery caches the results. As a result, the subsequent queries take much less time. It is possible to disable caching with query options. BigQuery also keeps track of some stats about the queries such as creation time, end time, total bytes processed.

In this step, you will disable caching and also display some stats about the queries.

Navigate to the Program.cs file inside the BigQueryDemo folder and replace the code with the following. Make sure you replace projectId with your actual project id:

using System;
using Google.Cloud.BigQuery.V2;

namespace BigQueryDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            var client = BigQueryClient.Create("projectId");
            var table = client.GetTable("bigquery-public-data", "github_repos", "commits");
            
            var sql = $"SELECT subject AS subject, COUNT(*) AS num_duplicates FROM {table} GROUP BY subject ORDER BY num_duplicates DESC LIMIT 10";
            var queryOptions = new QueryOptions {
                UseQueryCache = false
            };

            var results = client.ExecuteQuery(sql, parameters: null, queryOptions: queryOptions);

            foreach (var row in results)
            {
                Console.WriteLine($"{row["subject"]}: {row["num_duplicates"]}");
            }

            var job = client.GetJob(results.JobReference);
            var stats = job.Statistics;
            Console.WriteLine("----------");
            Console.WriteLine($"Creation time: {stats.CreationTime}");
            Console.WriteLine($"End time: {stats.EndTime}");
            Console.WriteLine($"Total bytes processed: {stats.TotalBytesProcessed}");
        }
    }
}

A couple of things to note about the code. First, caching is disabled by introducing query options and setting UseQueryCache to false. Second, you accessed the statistics about the query from the job object.

Back in Cloud Shell, run the app:

dotnet run

Like before, you should see a list of commit messages and their occurrences. In addition, you should also see some stats about the query in the end

Update README.md: 2509242
: 1971725
Initial commit: 1942149
Mirroring from Micro.blog.: 838586
update: 575188
Update data.json: 548651
Update data.js: 548339
Add files via upload: 379941
*** empty log message ***: 358528
Can't you see I'm updating the time?: 286863
----------
Creation time: 1533052057398
End time: 1533052066961
Total bytes processed: 9944197093

8. Loading data into BigQuery

If you want to query your own data, you need first load your data into BigQuery. BigQuery supports loading data from many sources such as Google Cloud Storage, other Google services, a readable source. You can even stream your data using the streaming inserts. You can read more on Loading Data into BigQuery page.

In this step, you will load a JSON file stored on Google Cloud Storage into a BigQuery table. The JSON file is located at gs://cloud-samples-data/bigquery/us-states/us-states.json

If you're curious about the contents of the JSON file, you can use gsutil command line tool to download it in the Cloud Shell:

gsutil cp gs://cloud-samples-data/bigquery/us-states/us-states.json .

Copying gs://cloud-samples-data/bigquery/us-states/us-states.json...
/ [1 files][  2.0 KiB/  2.0 KiB]                                                
Operation completed over 1 objects/2.0 KiB.

You can see that it contains the list of US states and each state is a JSON document on a separate line:

less us-states.json

{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr":  "AK"}
...

To load this JSON file into BigQuery, navigate to the Program.cs file inside the BigQueryDemo folder and replace the code with the following. Make sure you replace projectId with your actual project id:

using System;
using Google.Cloud.BigQuery.V2;

namespace BigQueryDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            var gcsUri = "gs://cloud-samples-data/bigquery/us-states/us-states.json";
            var client = BigQueryClient.Create("projectId");
            var dataset = client.GetOrCreateDataset("us_states_dataset");

            var schema = new TableSchemaBuilder 
            {
                { "name", BigQueryDbType.String },
                { "post_abbr", BigQueryDbType.String }
            }.Build();

            var jobOptions = new CreateLoadJobOptions
            {
                SourceFormat = FileFormat.NewlineDelimitedJson
            };

            var table = dataset.GetTableReference("us_states_table");
            var loadJob = client.CreateLoadJob(gcsUri, table, schema, jobOptions);

            loadJob.PollUntilCompleted();
            loadJob.ThrowOnAnyError();
            Console.WriteLine("Json file loaded to BigQuery");
        }
    }
}

Take a minute of two to study how the code loads the JSON file and creates a table with a schema under a dataset.

Back in Cloud Shell, run the app:

dotnet run

A dataset and a table are created in BigQuery

Json file loaded to BigQuery

To verify that the dataset is actually created, you can go to the BigQuery console. You should see a new dataset and a table created. If you switch to the preview tab of the table, you can see the actual data:

9. Congratulations!

You learned how to use BigQuery using C#!

Clean up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

Go to the Cloud Platform Console.
Select the project you want to shut down, then click ‘Delete' at the top: this schedules the project for deletion.

Learn More

Google BigQuery: https://cloud.google.com/bigquery/docs/
C#/.NET on Google Cloud Platform: https://cloud.google.com/dotnet/
Google Cloud .NET client: https://googlecloudplatform.github.io/google-cloud-dotnet/

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.