You might think that aggregate statistics don't leak any information about the individuals whose data the statistics are composed of. However, there are many ways an attacker can learn sensitive information about individuals in a dataset from an aggregate statistic.

To protect individuals' privacy, you will learn how to produce private statistics using differentially private aggregations from Privacy on Beam. Privacy on Beam is a differential privacy framework that works with Apache Beam.

What do we mean by "private"?

When using the word ‘private' throughout this Codelab, we mean that the output is produced in a way that does not leak any private information about the individuals in the data. We can do this using differential privacy, a strong privacy notion of anonymization. Anonymization is the process of aggregating data across multiple users to protect user privacy. All anonymization methods use aggregation but not all aggregation methods achieve anonymization. Differential privacy, on the other hand, provides measurable guarantees regarding information leakage and privacy.

To better understand differential privacy, let us look at a simple example.

This bar chart shows the busyness of a small restaurant on one particular evening. Lots of guests come at 7pm, and the restaurant is completely empty at 1am:

This looks useful!

There's a catch. When a new guest arrives, this fact is immediately revealed by the bar chart. Look on the chart: it's clear that there's a new guest, and that this guest has arrived at roughly 1am:

This isn't great from a privacy perspective. A truly anonymized statistics shouldn't reveal individual contributions. Putting those two charts side by side makes it even more apparent: the orange bar chart has one extra guest that has arrived at ~1am:

Again, that's not great. What do we do?

We'll make bar charts a bit less accurate by adding random noise!

Look on the two bar charts below. While not entirely accurate, they're still useful, and they don't reveal individual contributions. Nice!

Differential privacy is adding the right amount of random noise to mask individual contributions.

Our analysis was somewhat oversimplified. Implementing differential privacy properly is more involved and has a number of quite unexpected implementation subtleties. Similar to cryptography, creating your own implementation of differential privacy might not be a great idea. You can use Privacy on Beam instead of implementing your own solution. Don't roll your own differential privacy!

In this codelab, we'll show how to perform differentially private analysis using Privacy on Beam.

First, download Privacy on Beam:

https://github.com/google/differential-privacy/archive/master.zip

Or you can clone the Github repository:

git clone https://github.com/google/differential-privacy.git

Privacy on Beam is in the top level privacy-on-beam/ directory.

The code for this codelab and the dataset is in the privacy-on-beam/codelab/ directory.

You also need to have Bazel installed on your computer. Find the installation instructions for your operating system on the Bazel website.

Imagine you are a restaurant owner and would like to share some statistics about your restaurant, such as disclosing popular visit times. Thankfully, you know about Differential Privacy and Anonymization, so you want to do this in a way that does not leak information about any individual visitor.

The code for this example is in codelab/count.go.

Let's start with loading a mock dataset containing visits to your restaurant on a particular Monday. The code for this is not interesting for the purposes of this codelab, but you can check out the code for that in codelab/main.go, codelab/utils.go and codelab/visit.go.

Visitor ID

Time entered

Time spent (mins)

Money spent (euros)

1

9:30:00 AM

26

24

2

11:54:00 AM

53

17

3

1:05:00 PM

81

33

You'll first produce a non-private bar chart of visit times to your restaurant using Beam in the code sample below. Scope is a representation of the pipeline, and each new operation we do on the data gets added to the Scope. CountVisitsPerHour takes a Scope and a collection of visits, which is represented as a PCollection in Beam. It extracts the hour of each visit by applying the extractVisitHour function on the collection. Then it counts the occurrences of each hour and returns it.

func CountVisitsPerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("CountVisitsPerHour")
    visitHours := beam.ParDo(s, extractVisitHour, col)
    visitsPerHour := stats.Count(s, visitHours)
    return visitsPerHour
}

func extractVisitHour(v Visit) int {
    return v.TimeEntered.Hour()
}

This produces a nice bar chart (by running bazel run codelab -- --example="count" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/count.csv --output_chart_file=$(pwd)/count.png):

The next step is to convert your pipeline and your bar chart, into a private one. We do this as follows.

First, call MakePrivateFromStruct on a PCollection<V> to get a PrivatePCollection<V>. The input PCollection needs to be a collection of structs. We need to input a PrivacySpec and an idFieldPath as input to MakePrivateFromStruct.

spec := pbeam.NewPrivacySpec(epsilon, delta)
pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

PrivacySpec is a struct that holds the differential privacy parameters (epsilon and delta) we want to use to anonymize the data. (You don't need to worry about them for now, we have an optional section later if you would like to learn more about those.)

idFieldPath is the path of the user identifier field within the struct (Visit in our case). Here, the user identifier of the visitors is the VisitorID field of Visit.

Then, we call pbeam.Count() instead of stats.Count(), pbeam.Count() takes as input a CountParams struct which holds parameters such as MaxValue that affect the accuracy of the output.

visitsPerHour := pbeam.Count(s, visitHours, pbeam.CountParams{
    MaxPartitionsContributed: 1, // Visitors can visit the restaurant once (one hour) a day
    MaxValue:                 1, // Visitors can visit the restaurant once within an hour
})

Similarly, MaxPartitionsContributed bounds how many different visit hours a user can contribute. We expect them to visit the restaurant at most once a day (or we don't care if they visit it multiple times over the course of the day), so we set it to 1 as well. We'll talk about these parameters in more detail in an optional section.

MaxValue bounds how many times a single user can contribute to the values we are counting. In this particular case, The values we are counting are visit hours, and we expect a user to visit the restaurant only once (or we don't care if they visit it multiple times per hour), so we set this parameter to 1.

In the end, your code will look like this:

func PrivateCountVisitsPerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateCountVisitsPerHour")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, delta)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    visitHours := pbeam.ParDo(s, extractVisitHour, pCol)
    visitsPerHour := pbeam.Count(s, visitHours, pbeam.CountParams{
        MaxPartitionsContributed: 1, // Visitors can visit the restaurant once (one hour) a day
        MaxValue:                 1, // Visitors can visit the restaurant once within an hour
    })
    return visitsPerHour
}

When we run the code again (with bazel run codelab -- --example="count" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/count.csv --output_chart_file=$(pwd)/count.png), we see a similar bar chart:

Congratulations! You calculated your first differentially private statistic!

The bar chart you get when you run the code might be different from this one. That's OK. Because of the noise in differential privacy, you'll get a different bar chart each time you run the code, but you can see that they are more or less similar to the original non-private bar chart we had.

Please note that it is very important for the privacy guarantees to not re-run the pipeline multiple times (for example, in order to get a better-looking bar chart). The reason why you shouldn't re-run your pipelines is explained in the "Computing Multiple Statistics" section.

Now that we know how to count stuff in a differentially private way, let us look into calculating means. More specifically, we will now compute the average length of stay of visitors.

The code for this example is in codelab/mean.go.

Normally, to calculate a non-private mean of stay durations, we would use stats.MeanPerKey() with a pre-processing step that converts the incoming PCollection of visits to a PCollection<K,V> where K is the visit hour and V is the time the visitor spent in the restaurant.

func MeanTimeSpent(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("MeanTimeSpent")
    hourToTimeSpent := beam.ParDo(s, extractVisitHourAndTimeSpentFn, col)
    meanTimeSpent := stats.MeanPerKey(s, hourToTimeSpent)
    return meanTimeSpent
}

func extractVisitHourAndTimeSpentFn(v Visit) (int, int) {
    return v.TimeEntered.Hour(), v.MinutesSpent
}

And the output we get (by running bazel run codelab -- --example="mean" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/mean.csv --output_chart_file=$(pwd)/mean.png is:

To make this differentially private, we again convert our PCollection to a PrivatePCollection and replace stats.MeanPerKey() with pbeam.MeanPerKey(). Similar to Count, we have MeanParams that hold some parameters such as MinValue and MaxValue that affect the accuracy. MinValue and MaxValue represent the bounds we have for each user's contribution to each key.

meanTimeSpent := pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
    MaxPartitionsContributed:     1,  // Visitors can visit the restaurant once (one hour) a day
    MaxContributionsPerPartition: 1,  // Visitors can visit the restaurant once within an hour
    MinValue:                     0,  // Minimum time spent per user (in mins)
    MaxValue:                     60, // Maximum time spent per user (in mins)
})

In this case, each key represents an hour and values are the time visitors spent. We set MinValue to 0 because we don't expect visitors to spend less than 0 minutes in the restaurant. We set MaxValue to 60, which means if a visitor spends more than 60 minutes, we act as if that user spent 60 minutes.

In the end, your code will look like this:

func PrivateMeanTimeSpent(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateMeanTimeSpent")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, delta)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    hourToTimeSpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    meanTimeSpent := pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
        MaxPartitionsContributed:     1,  // Visitors can visit the restaurant once (one hour) a day
        MaxContributionsPerPartition: 1,  // Visitors can visit the restaurant once within an hour
        MinValue:                     0,  // Minimum time spent per user (in mins)
        MaxValue:                     60, // Maximum time spent per user (in mins)
    })
    return meanTimeSpent
}

When we run the differentially private pipeline (with bazel run codelab -- --example="mean" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/mean.csv --output_chart_file=$(pwd)/mean.png), we get:

Again, similar to count, since this is a differentially private operation, we'll get different results each time we run it. But you can see the differentially private lengths of stay are not far off from the actual result.

Another interesting statistic we could look at is revenue per hour over the course of the day.

The code for this example is in codelab/sum.go.

Again, we'll start with the non-private version. With some pre-processing on our mock dataset, we can create a PCollection<K,V> where K is the visit hour and V is the money the visitor spent in the restaurant: To calculate a non-private revenue per hour, we can simply sum all the money visitors spent by calling stats.SumPerKey():

func RevenuePerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("RevenuePerHour")
    hourToMoneySpent := beam.ParDo(s, extractVisitHourAndMoneySpent, col)
    revenues := stats.SumPerKey(s, hourToMoneySpent)
    return revenues
}

func extractVisitHourAndMoneySpent(v Visit) (int, int) {
    return v.TimeEntered.Hour(), v.MoneySpent
}

And the output we get (by running bazel run codelab -- --example="sum" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/sum.csv --output_chart_file=$(pwd)/sum.png) is:

To make this differentially private, we again convert our PCollection to a PrivatePCollection and replace stats.SumPerKey() with pbeam.SumPerKey(). Similar to Count and MeanPerKey, we have SumParams that hold some parameters such as MinValue and MaxValue that affect the accuracy.

revenues := pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
    MaxPartitionsContributed: 1,  // Visitors can visit the restaurant once (one hour) a day
    MinValue:                 0,  // Minimum money spent per user (in euros)
    MaxValue:                 40, // Maximum money spent per user (in euros)
})

In this case, MinValue and MaxValue represent the bounds we have for the money each visitor spends. We set MinValue to 0 because we don't expect visitors to spend less than 0 euros in the restaurant. We set MaxValue to 40, which means if a visitor spends more than 40 euros, we act as if that user spent 40 euros.

In the end, the code will look like this:

func PrivateRevenuePerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateRevenuePerHour")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, delta)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    hourToMoneySpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    revenues := pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
        MaxPartitionsContributed: 1,  // Visitors can visit the restaurant once (one hour) a day
        MinValue:                 0,  // Minimum money spent per user (in euros)
        MaxValue:                 40, // Maximum money spent per user (in euros)
    })
    return revenues
}

When we run the differentially private pipeline (with bazel run codelab -- --example="sum" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/sum.csv --output_chart_file=$(pwd)/sum.png), we get:

Again, similar to count and mean, since this is a differentially private operation, we'll get different results each time we run it. But you can see the differentially private result is very close to the actual revenues per hour.

Most of the time, you might be interested in computing multiple statistics over the same underlying data, similar to what you have done with count, mean and sum. It is usually cleaner and easier to do this in a single Beam pipeline and in a single binary. You can do this with Privacy on Beam as well. You can write a single pipeline to run your transformations and computations and use a single PrivacySpec for the whole pipeline.

Not only is it more convenient to do this with a single PrivacySpec, it is also better in terms of privacy. If you remember the epsilon and delta parameters we supply to the PrivacySpec, they represent something called a privacy budget, which is a measure of how much of the privacy of the users in the underlying data you are leaking.

An important thing to remember about the privacy budget is that it is additive: If you run a pipeline with a particular epsilon ε and delta δ a single time, you are spending an (ε,δ) budget. If you run it a second time, you'll have spent a total budget of (2ε, 2δ). Similarly, if you compute multiple statistics with a PrivacySpec (and consecutively a privacy budget) of (ε,δ), you'll have spent a total budget of (2ε, 2δ). This means that you are degrading the privacy guarantees.

In order to circumvent this, when you want to compute multiple statistics over the same underlying data, you are supposed to use a single PrivacySpec with the total budget you want to use. You'd then need to specify the epsilon and delta you want to use up for each aggregation. In the end, you are going to end up with the same overall privacy guarantee; but the higher epsilon and delta a particular aggregation has, the higher accuracy it is going to have.

To see this in action, we can compute the three statistics (count, mean and sum) we computed separately before in a single pipeline.

The code for this example is in codelab/multiple.go. Notice how we are splitting the total (ε,δ) budget equally between the three aggregations:

func ComputeCountMeanSum(s beam.Scope, col beam.PCollection) (visitsPerHour, meanTimeSpent, revenues beam.PCollection) {
    s = s.Scope("ComputeCountMeanSum")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, delta) // Shared by count, mean and sum.
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    visitHours := pbeam.ParDo(s, extractVisitHour, pCol)
    visitsPerHour = pbeam.Count(s, visitHours, pbeam.CountParams{
        Epsilon:                  epsilon / 3,
        Delta:                    delta / 3,
        MaxPartitionsContributed: 1, // Visitors can visit the restaurant once (one hour) a day
        MaxValue:                 1, // Visitors can visit the restaurant once within an hour
    })

    hourToTimeSpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    meanTimeSpent = pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
        Epsilon:                      epsilon / 3,
        Delta:                        delta / 3,
        MaxPartitionsContributed:     1,  // Visitors can visit the restaurant once (one hour) a day
        MaxContributionsPerPartition: 1,  // Visitors can visit the restaurant once within an hour
        MinValue:                     0,  // Minimum time spent per user (in mins)
        MaxValue:                     60, // Maximum time spent per user (in mins)
    })

    hourToMoneySpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    revenues = pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
        Epsilon:                  epsilon / 3,
        Delta:                    delta / 3,
        MaxPartitionsContributed: 1,  // Visitors can visit the restaurant once (one hour) a day
        MinValue:                 0,  // Minimum money spent per user (in euros)
        MaxValue:                 40, // Maximum money spent per user (in euros)
    })

    return visitsPerHour, meanTimeSpent, revenues
}

You have seen quite a few parameters mentioned in this codelab: epsilon, delta, maxPartitionsContributed, etc. We can roughly divide them into two categories: Privacy Parameters and Utility Parameters.

Privacy Parameters

Epsilon and delta are the parameters that quantify the privacy we are providing by using differential privacy. More precisely, epsilon and delta are a measure of how much information a potential attacker gains about the underlying data by looking at the anonymized output. The higher epsilon and delta are, the more information the attacker gains about the underlying data, which is a privacy risk.

On the other hand, the lower epsilon and delta are, the more noise you need to add to the output to be anonymous, and a higher number of unique users you need have in each partition to keep that partition in the anonymized output. So, there is a tradeoff between utility and privacy here.

In Beam on Privacy, you need to be worried about the privacy guarantees you want in your anonymized output when you specify the total privacy budget in the PrivacySpec. The caveat is that if you want your privacy guarantees to hold, you need to follow the advice in this codelab about not overusing your budget by having a separate PrivacySpec for each aggregation or running the pipeline multiple times.

For more information about Differential Privacy and what the privacy parameters mean, you can take a look at the literature.

Utility Parameters

These are parameters that don't affect the privacy guarantees (as long as advice on how to use Privacy on Beam is properly followed) but affect the accuracy, and consequently the utility of the output. They are provided in the Params structs of each aggregation, e.g. CountParams, SumParams, etc. These parameters are used to scale the noise being added.

A utility parameter provided in Params and applicable to all aggregations is MaxPartitionsContributed. A partition corresponds to a key of the PCollection outputted by a Privacy On Beam aggregation operation, i.e. Count, SumPerKey, etc. So, MaxPartitionsContributed bounds how many distinct key values a user can contribute to in the output. If a user contributes to more than MaxPartitionsContributed keys in the underlying data, some of her contributions will be dropped so that she contributes to exactly MaxPartitionsContributed keys.

Similar to MaxPartitionsContributed, most aggregations have a MaxContributionsPerPartition parameter. They are provided in the Params structs and each aggregation could have separate values for them. As opposed to MaxPartitionsContributed, MaxContributionsPerPartition bounds a user's contribution for each key. In other words, a user can contribute only MaxContributionsPerPartition values for each key.

The noise added to the output is scaled by MaxPartitionsContributed and MaxContributionsPerPartition, so there is a tradeoff here: Larger MaxPartitionsContributed and MaxContributionsPerPartition both mean you keep more data, but you'll end up with a more noisy result.

Some aggregations require MinValue and MaxValue. These specify the bounds for contributions of each user. If a user contributes a value lower than MinValue, that value is going to be clamped up to MinValue. Similarly, if a user contributes a value larger than MaxValue, that value is going to be clamped down to MaxValue. This means that in order to keep more of the original values, you have to specify larger bounds. Similar to MaxPartitionsContributed and MaxContributionsPerPartition, noise is scaled by the size of the bounds, so larger bounds mean you keep more data, but you'll end up with a more noisy result.

Last parameter we'll be talking about is NoiseKind. We support two different noise mechanisms in Privacy On Beam: GaussianNoise and LaplaceNoise. Both have their advantages and disadvantages but Laplace distribution gives better utility with low contribution bounds, that is why Privacy On Beam uses it by default. However, if you wish to use a Gaussian distribution noise, you can supply Params with a pbeam.GaussianNoise{} variable.

Great job! You finished the Privacy on Beam codelab. You learned a lot about differential privacy and Privacy on Beam:

But, there are many more aggregations you can do with Privacy on Beam! You can learn more about them on the GitHub repository or the godoc.

If you have the time, please give us feedback about the codelab by filling a survey.