In this codelab, you'll take data from the US Federal Election Commission, clean it up, and load it into BigQuery. You'll also get a chance to ask some interesting questions of that dataset.

While this codelab doesn't assume any prior experience with BigQuery, some understanding of SQL will help you get more out of it.

What you'll learn

What you'll need

Step 1

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:



Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Step 2

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). BigQuery pricing is documented here.

New users of Google Cloud Platform are eligible for a $300 free trial.

Step 3

While Google Cloud and Big Query can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.

This Debian-based virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. This means that all you will need for this codelab is a browser (yes, it works on a Chromebook).

To activate Google Cloud Shell, from the developer console simply click the button on the top right-hand side (it should only take a few moments to provision and connect to the environment):

Step 4

Once connected to the cloud shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID :

gcloud auth list

Command output

Credentialed accounts: 
- <myaccount>@<mydomain>.com (active)

Step 5

Enter the following command:

gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If for some reason the project is not set, simply issue the following command :

gcloud config set project <PROJECT_ID>

Looking for your PROJECT_ID? Check out what ID you used in the setup steps or look it up in the console dashboard:

Step 6

Finally, set the default zone and project configuration:

gcloud config set compute/zone us-central1-f

You can choose a variety of different zones. Learn more in the Regions & Zones documentation.

Step 1

In order to run the BigQuery queries in this codelab, you'll need your own dataset. Pick a name for it, such as campaign_funding. Run the following commands in CloudShell:

DATASET=campaign_funding; \
bq mk -d ${DATASET}

After your dataset has been created, you should be ready to go. Running this command should also help to verify that you've got the bq command line client setup correctly, authentication is working, and you have write access to the cloud project you're operating under. If you have more than one project, you will be prompted to select the one you're interested in from a list.

The US Federal Election Commission campaign finance dataset has been decompressed and copied to the GCS bucket gs://campaign-funding/.

Step 1

Let's download one of the source files locally so that we can see what it looks like. Run the following commands from a command window:

gsutil cp gs://campaign-funding/indiv16.txt . ;\
tail indiv16.txt

This should display the contents of the individual contributions file. There are three types of files we'll be looking at for this codelab: individual contributions (indiv*.txt), candidates (cn*.txt), and committees (cm*.txt). If you're interested, use the same mechanism to check out what is in those other files.

The dataset you'll be using is described on the federal election website here. The schemas for the tables we'll be looking at are:

Step 1

Run the following command to import the individual donor data into a new table in BigQuery:

bq load \
-q \
--max_bad_records=10000 \
--source_format=CSV \
--noallow_jagged_rows \
--field_delimiter="|" \
--noallow_quoted_newlines \
--skip_leading_rows=0 \
--ignore_unknown_values \
--encoding=UTF-8 \
--quote=\" \
${DATASET}.transactions \
gs://campaign-funding/indiv*.txt \
CMTE_ID,AMNDT_IND,RPT_TP,TRANSACTION_PGI,IMAGE_NUM,TRANSACTION_TP,ENTITY_TP,NAME,CITY,STATE,ZIP_CODE,EMPLOYER,OCCUPATION,TRANSACTION_DT,TRANSACTION_AMT:FLOAT,OTHER_ID,TRAN_ID,FILE_NUM,MEMO_CD,MEMO_TEXT,SUB_ID

Step 2

Run the following command to import the candidate data into a new table in BigQuery:

bq load \
-q \
--source_format=CSV \
--noallow_jagged_rows \
--field_delimiter="|" \
--noallow_quoted_newlines \
--skip_leading_rows=0 \
--ignore_unknown_values \
--encoding=UTF-8 \
${DATASET}.candidates \
gs://campaign-funding/cn*.txt \
CAND_ID,CAND_NAME,CAND_PTY_AFFILIATION,CAND_ELECTION_YR,CAND_OFFICE_ST,CAND_OFFICE,CAND_OFFICE_DISTRICT,CAND_ICI,CAND_STATUS,CAND_PCC,CAND_ST1,CAND_ST2,CAND_CITY,CAND_ST,CAND_ZIP

Step 3

Run the following command to import the committee data into a new table in BigQuery:

bq load \
-q \
--max_bad_records=10000 \
--source_format=CSV \
--allow_jagged_rows \
--field_delimiter="|" \
--noallow_quoted_newlines \
--skip_leading_rows=0 \
--ignore_unknown_values \
--encoding=UTF-8 ${DATASET}.committees \
gs://campaign-funding/cm*.txt \
CMTE_ID,CMTE_NM,TRES_NM,CMTE_ST1,CMTE_ST2,CMTE_CITY,CMTE_ST,CMTE_ZIP,CMTE_DSGN,CMTE_TP,CMTE_PTY_AFFILIATION,CMTE_FILING_FREQ,ORG_TP,CONNECTED_ORG_NM,CAND_ID

Step 1

Next, let's actually try running a couple of queries. Open the BigQuery Web UI.

Step 2

Find your dataset in the left navigation pane (you might have to change the project dropdown in the top left corner), click the big red COMPOSE QUERY button, and enter the following query in the box:

SELECT * FROM [campaign_funding.transactions] 
WHERE EMPLOYER contains "GOOGLE" 
ORDER BY TRANSACTION_DT DESC
LIMIT 100

This will find the most recent 100 campaign donations by employees of Google. If you'd like, try playing around and finding campaign donations from residents of your zip code or find the largest donations in your city.

The query and the results will look something like this:

One thing you might notice, however, is that you can't really tell who the recipient was of these donations. We need to come up with some fancier queries to get that information.

Step 1

Click on the transactions table in the left pane, and click on the schema tab. It should look like the screenshot below:

We can see a list of fields that match the table definition we specified previously. You may notice there is no recipient field, or any way to figure out what candidate the donation supported. However, there is a field called CMTE_ID. This will let us link the committee that was the recipient of the donation to the donation. This still isn't all that useful.

Step 2

Next, click on the committees table to check out its schema. We've got a CMET_ID, which can join us to the transactions table. Another field is CAND_ID; this can be joined with a CAND_ID table in the candidates table. Finally, we have a link between transactions and candidates by going through the committees table.

Let's get a sample of the data by running a simple SELECT * query on the candidates table.

SELECT * FROM [campaign_funding.candidates]
LIMIT 100

The result should look something like this:

One thing you might notice, is that the candidate names are ALL CAPS and are presented in "lastname, firstname" order. This is a little bit annoying, since this isn't really how we tend to think of the candidates; we'd rather see "Barack Obama" than "OBAMA, BARACK". Moreover, the transaction dates (TRANSACTION_DT) in the transactions table are a bit awkward as well. They are string values in the format YYYYMMDD. We'll address these quirks in the next section.

Step 1

Now that we have an understanding of how transactions relate to candidates, let's run a query to figure out who is giving money to whom. Cut and paste the following query into the compose box:

SELECT affiliation, SUM(amount) AS amount
FROM (
  SELECT *
  FROM (
    SELECT
      t.amt AS amount,
      t.occupation AS occupation,
      c.affiliation AS affiliation,
    FROM (
      SELECT
        trans.TRANSACTION_AMT AS amt,
        trans.OCCUPATION AS occupation,
        cmte.CAND_ID AS CAND_ID
      FROM [campaign_funding.transactions] trans
      RIGHT OUTER JOIN EACH (
        SELECT
          CMTE_ID,
          FIRST(CAND_ID) AS CAND_ID
        FROM [campaign_funding.committees]
        GROUP EACH BY CMTE_ID ) cmte
      ON trans.CMTE_ID = cmte.CMTE_ID) AS t
    RIGHT OUTER JOIN EACH (
      SELECT
        CAND_ID,
        FIRST(CAND_PTY_AFFILIATION) AS affiliation,
      FROM [campaign_funding.candidates]
      GROUP EACH BY CAND_ID) c
    ON t.CAND_ID = c.CAND_ID )
  WHERE occupation CONTAINS "ENGINEER")
GROUP BY affiliation
ORDER BY amount DESC

This query joins the transactions table to the committees table and then to the candidates table. It only looks at transactions from people with the word "ENGINEER" in their occupation title. The query aggregates results by party affiliation; this lets us see the distribution of giving to various political parties amongst engineers.

We can see that engineers are a pretty balanced bunch, giving more or less evenly to democrats and republicans. But what is the ‘DFL' party? Wouldn't it be nice to actually get full names, rather than just a three letter code?

The party codes are defined on the FEC website. There is a table that matches the party code to the full name (it turns out that ‘DFL' is ‘Democratic-Farmer-Labor'). While we could manually perform the translations in our query, that seems like a lot of work, and difficult to keep in sync.

What if we could parse the HTML as part of the query? Right click anywhere on that page and look at "view page source". There is a lot of header / boilerplate information in the source, but find the <table> tag. Each mapping row is in an HTML <tr> element, the name and the code are both wrapped in <td> elements. Each row will look something like this:

The HTML looks something like this:

<tr bgcolor="#F5F0FF">
    <td scope="row"><div align="left">ACE</div></td>
    <td scope="row">Ace Party</td>
    <td scope="row"></td>
</tr>

Note that BigQuery can't read the file directly from the web; this because bigquery is capable of hitting a source from thousands of workers simultaneously. If this were allowed to run against random web pages, it would essentially be a distributed denial of service attack (DDoS). The html file from the FEC web page is stored in the gs://campaign-funding bucket.

We'll need to make a table based on the campaign funding data. This will be similar to the other GCS-backed tables we created. The difference here is that we don't actually have a schema; we'll just use a single field per row and call it ‘data'. We'll pretend that it is a CSV file, with but instead of comma-delimiting, we'll use a bogus delimiter (`) and no quote character.

Step 1

To create the party lookup table, run the following command from the command line:

echo '{"csvOptions": {"allowJaggedRows": false, "skipLeadingRows": 0, "quote": "", "encoding": "UTF-8", "fieldDelimiter": "`", "allowQuotedNewlines": false}, "ignoreUnknownValues": true, "sourceFormat": "CSV", "sourceUris": ["gs://campaign-funding/party_codes.shtml"], "schema": {"fields": [{"type": "STRING", "name": "data"}]}}' > party_raw_def.json; \
bq mk --external_table_definition=party_raw_def.json -t ${DATASET}.raw_party_codes 

Step 2

We will now use javascript to parse the file. In the top right of the BigQuery Query Editor should be a button labeled UDF Editor. Click on it to switch to editing a javascript UDF. The UDF editor will be populated with some commented out boilerplate.

Go ahead and delete the code it contains, and enter the following code:

function tableParserFun(row, emitFn) {
  if (row.data != null && row.data.match(/<tr.*<\/tr>/) !== null) {
    var txt = row.data
    var re = />\s*(\w[^\t<]*)\t*<.*>\s*(\w[^\t<]*)\t*</;
    matches = txt.match(re);
    if (matches !== null && matches.length > 2) {
        var result = {code: matches[1], name: matches[2]};
        emitFn(result);
    } else {
        var result = { code: 'ERROR', name: matches};
        emitFn(result);
    }
  }
}

bigquery.defineFunction(
  'tableParser',               // Name of the function exported to SQL
  ['data'],                    // Names of input columns
  [{'name': 'code', 'type': 'string'},  // Output schema
   {'name': 'name', 'type': 'string'}],
  tableParserFun // Reference to JavaScript UDF
);

The javascript here is divided into two pieces; the first is a function that takes a row of input emits a parsed output. The other is a definition that registers that function as a User Defined Function (UDF) with the name tableParser, and indicates that it takes an input column called ‘data' and outputs two columns, code and name. The code column will be the three-letter code, the name column is the full name of the party.

Step 3

Switch back to the Query Editor tab, and enter the following query:

SELECT code, name FROM tableParser([campaign_funding.raw_party_codes])
ORDER BY code

Click on Run Query. Running this query will parse the raw HTML file and output the field values in structured format. Pretty slick, eh? See if you can figure out what ‘DFL' stands for.

Step 1

Now that we can translate party codes to names, let's try another query that uses this to find out something interesting. Run the following query:

SELECT
  candidate,
  election_year,
  FIRST(candidate_affiliation) AS affiliation,
  SUM(amount) AS amount
FROM (
  SELECT 
    CONCAT(REGEXP_EXTRACT(c.candidate_name,r'\w+,[ ]+([\w ]+)'), ' ',
      REGEXP_EXTRACT(c.candidate_name,r'(\w+),')) AS candidate,
    pty.candidate_affiliation_name AS candidate_affiliation,
    c.election_year AS election_year,
    t.amt AS amount,
  FROM (
    SELECT
      trans.TRANSACTION_AMT AS amt,
      cmte.committee_candidate_id AS committee_candidate_id
    FROM [campaign_funding.transactions] trans
    RIGHT OUTER JOIN EACH (
      SELECT
        CMTE_ID,
        FIRST(CAND_ID) AS committee_candidate_id
      FROM [campaign_funding.committees]
      GROUP BY CMTE_ID ) cmte
    ON trans.CMTE_ID = cmte.CMTE_ID) AS t
  RIGHT OUTER JOIN EACH (
    SELECT
      CAND_ID AS candidate_id,
      FIRST(CAND_NAME) AS candidate_name,
      FIRST(CAND_PTY_AFFILIATION) AS affiliation,
      FIRST(CAND_ELECTION_YR) AS election_year,
    FROM [campaign_funding.candidates]
    GROUP BY candidate_id) c
  ON t.committee_candidate_id = c.candidate_id
  JOIN (
    SELECT
      code,
      name AS candidate_affiliation_name
    FROM (tableParser([campaign_funding.raw_party_codes]))) pty
  ON pty.code = c.affiliation )
GROUP BY candidate, election_year
ORDER BY amount DESC
LIMIT 100

This query will show which candidates got the largest campaign donations, and will spell out their party affiliations.

You've now cleaned and imported data from the FEC website into BigQuery!

What we've covered

Next Steps

Learn More

Give us your feedback