The flexibility to pay for what you use is one of many benefits for having your data warehouse in the cloud. Whether you are moving existing queries to BigQuery or writing new ones, it's beneficial to know the costs associated with your usage. The purpose of this codelab is to provide a step-by-step guide for collecting cost and performance data from Stackdriver and then to visualize it in a Data Studio dashboard.
BigQuery is Google's fully managed, petabyte scale, analytics data warehouse. BigQuery is NoOps—there is no infrastructure to manage and you don't need a database administrator—so you can focus on analyzing data to find meaningful insights.
BigQuery has two pricing models: on-demand and flat rate. In the on-demand pricing model, the amount you pay is based solely on usage, specifically, the number of bytes your query scans. The flat rate pricing model is a flat fee irrespective of how many bytes your query scans.
In both pricing models, it is beneficial to understand how your queries are performing, and how many resources they're consuming.
The below steps are documented as you would complete them using the console, but we want to be able to automate this portion of the setup.
Create a new dataset to contain exported audit logs. We will also be using these variables later, so be sure to use the same shell session for the following steps, or set the variables as needed. Be sure to replace <PROJECT_ID> with your project's ID.
export PROJECT=<PROJECT_ID> export DATASET=audit bq --project $PROJECT mk $DATASET
Create a new logging export sink as follows:
gcloud --project $PROJECT logging sinks create bigquery-audit bigquery.googleapis.com/projects/$PROJECT/datasets/audit --log-filter resource.type="bigquery_resource"
Once done, you need to allow the log writer service account to write to the sink.
Create an environment variable to use in the next step. The variable SA_EMAIL will contain the service account email address that we will grant WRITER access to our audit dataset.
Note that the following depends on the jq utility to parse the output JSON, so you may need to install that using the package manager for your operating system. If using Cloud Shell you can install it with apt-get install jq.
export SA_EMAIL=$(gcloud --project $PROJECT logging sinks describe bigquery-audit --format json | jq -r '.["writerIdentity"]' | sed -e 's/serviceAccount://g')
Assign the BigQuery dataset writer role to the service account. First, create a script to update the access entries:
cat <<EOF>update_for_audit.py #!/usr/bin/env python import os from google.cloud import bigquery SA_EMAIL=os.environ["SA_EMAIL"] PROJECT=os.environ["PROJECT"] DATASET=os.environ["DATASET"] client = bigquery.Client(project=PROJECT) dataset = client.get_dataset(client.dataset(DATASET)) entry = bigquery.AccessEntry(role="WRITER", entity_type="userByEmail", entity_id=SA_EMAIL) entries = list(dataset.access_entries) entries.append(entry) dataset.access_entries = entries dataset = client.update_dataset(dataset, ["access_entries"]) EOF
Run the script to add the service account with WRITER permission to the audit dataset.
You should now see logs flowing into a newly created table in the dataset above.
Note: If you are using the Beta UX for BigQuery, you may have to refresh the page to see the new audit dataset appear.
Now that we've set up logging export, let's submit a simple query before we create views over the exported audit logs.
bq query 'SELECT 42 as the_answer'
You should see something like this:
$ bq query 'SELECT 42 as the_answer' Waiting on bqjob_r2bd8609c7ffcf931_00000168e47fd1c9_1 ... (0s) Current status: DONE +------------+ | the_answer | +------------+ | 42 | +------------+
Next, create a view over the raw audit table. The view query is large, and won't display well here, so we put it in a script. You can review it on github, in this gist.
You can execute the script from the gist as follows, replacing <PROJECT_ID> with your project id:
export PROJECT=<PROJECT_ID> export DATASET=audit curl -sL https://gist.githubusercontent.com/geoffmc/bb4d52566045e904b1b907777044c1b6/raw/160d92749bb16d3f71f7b45321bfc48a903d1174/create_bq_query_audit.sh | bash
If this worked, you should see a single line of output.
View 'your-project-id:audit.bq_query_audit' successfully created.
Do the same for the load audit view.
curl -sL https://gist.githubusercontent.com/geoffmc/bb4d52566045e904b1b907777044c1b6/raw/160d92749bb16d3f71f7b45321bfc48a903d1174/create_bq_load_audit.sh | bash
Now that we have a good way to query the audit data, we should produce some visualizations.
We'll start with a sample dashboard, create a copy and update the data sources.
Open the sample dashboard in a web browser.
Click the "Make a copy of this report" button on the top right.
The "Create new report" menu should appear which allows you to select your data source(s). Replace the data sources as follows.
For the first data source, click the "Select a datasource" dropdown and choose "Create New Data Source".
Choose the connector for "BigQuery".
Navigate the project and dataset menus to select the view you created in the previous section titled "bq_query_audit" as the table. Click Connect and then "Add to report".
Repeat the steps above for the "Original Data Source" of "bq_load_audit" and select the view you created earlier titled "bq_load_audit".
Once the data sources have been updated, select "Create Report".
The Data Studio visualizations allow you to identify costly queries so that you can optimize for cost and query performance. It will also provide insight into the usage patterns and resource utilization associated with your workload. The three visualization tabs are as follows.
The dashboard shows high level metrics around the total number of queries executed, total estimated cost of all queries for the specified time period, volume of data processed and the average query runtime.
Data can be broken down for a specific time period using the date selector in the top left corner, and can optionally be filtered by project and / or user.
The lower table on the dashboard provides detailed information for individual queries and uses a heat map to identify the queries using the most resources (Avg Slots) and taking the longest amount of time to complete (Avg Runtime); these are good candidates to investigate whether or not they can be optimized.
Allows you to see information about loading your data into BigQuery. The charts show the number and size of loads, broken down by table and project.
The lower table provides detailed statistics on each load job; this allows you to see how much data was loaded and how long it took for each load job.
Provides insight into the query usage pattern, broken down by day of the week and hour of the day, and compared to the previous 7 days.