Computing Private Statistics with Privacy on Beam

1. Introduction

You might think that aggregate statistics don't leak any information about the individuals whose data the statistics are composed of. However, there are many ways an attacker can learn sensitive information about individuals in a dataset from an aggregate statistic.

To protect individuals' privacy, you will learn how to produce private statistics using differentially private aggregations from Privacy on Beam. Privacy on Beam is a differential privacy framework that works with Apache Beam.

What do we mean by "private"?

When using the word ‘private' throughout this Codelab, we mean that the output is produced in a way that does not leak any private information about the individuals in the data. We can do this using differential privacy, a strong privacy notion of anonymization. Anonymization is the process of aggregating data across multiple users to protect user privacy. All anonymization methods use aggregation but not all aggregation methods achieve anonymization. Differential privacy, on the other hand, provides measurable guarantees regarding information leakage and privacy.

2. Differential Privacy overview

To better understand differential privacy, let us look at a simple example.

This bar chart shows the busyness of a small restaurant on one particular evening. Lots of guests come at 7pm, and the restaurant is completely empty at 1am:

a43dbf3e2c6de596.png

This looks useful!

There's a catch. When a new guest arrives, this fact is immediately revealed by the bar chart. Look on the chart: it's clear that there's a new guest, and that this guest has arrived at roughly 1am:

bda96729e700a9dd.png

This isn't great from a privacy perspective. A truly anonymized statistics shouldn't reveal individual contributions. Putting those two charts side by side makes it even more apparent: the orange bar chart has one extra guest that has arrived at ~1am:

d562ddf799288894.png

Again, that's not great. What do we do?

We'll make bar charts a bit less accurate by adding random noise!

Look at the two bar charts below. While not entirely accurate, they're still useful, and they don't reveal individual contributions. Nice!

838a0293cd4fcfe3.gif

Differential privacy is adding the right amount of random noise to mask individual contributions.

Our analysis was somewhat oversimplified. Implementing differential privacy properly is more involved and has a number of quite unexpected implementation subtleties. Similar to cryptography, creating your own implementation of differential privacy might not be a great idea. You can use Privacy on Beam instead of implementing your own solution. Don't roll your own differential privacy!

In this codelab, we'll show how to perform differentially private analysis using Privacy on Beam.

3. Downloading Privacy on Beam

You don't need to download Privacy on Beam to be able to follow the codelab because all the relevant code and the graphs can be found in this document. However, if you wish to download in order to play with the code, run it yourself or use Privacy on Beam later on, feel free to do so by following the steps below.

Note that this codelab is for version 1.1.0 of the library.

First, download Privacy on Beam:

https://github.com/google/differential-privacy/archive/refs/tags/v1.1.0.tar.gz

Or you can clone the Github repository:

git clone --branch v1.1.0 https://github.com/google/differential-privacy.git

Privacy on Beam is in the top level privacy-on-beam/ directory.

The code for this codelab and the dataset is in the privacy-on-beam/codelab/ directory.

You also need to have Bazel installed on your computer. Find the installation instructions for your operating system on the Bazel website.

4. Computing visits per hour

Imagine you are a restaurant owner and would like to share some statistics about your restaurant, such as disclosing popular visit times. Thankfully, you know about Differential Privacy and Anonymization, so you want to do this in a way that does not leak information about any individual visitor.

The code for this example is in codelab/count.go.

Let's start with loading a mock dataset containing visits to your restaurant on a particular Monday. The code for this is not interesting for the purposes of this codelab, but you can check out the code for that in codelab/main.go, codelab/utils.go and codelab/visit.go.

Visitor ID

Time entered

Time spent (mins)

Money spent (euros)

1

9:30:00 AM

26

24

2

11:54:00 AM

53

17

3

1:05:00 PM

81

33

You'll first produce a non-private bar chart of visit times to your restaurant using Beam in the code sample below. Scope is a representation of the pipeline, and each new operation we do on the data gets added to the Scope. CountVisitsPerHour takes a Scope and a collection of visits, which is represented as a PCollection in Beam. It extracts the hour of each visit by applying the extractVisitHour function on the collection. Then it counts the occurrences of each hour and returns it.

func CountVisitsPerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("CountVisitsPerHour")
    visitHours := beam.ParDo(s, extractVisitHourFn, col)
    visitsPerHour := stats.Count(s, visitHours)
    return visitsPerHour
}

func extractVisitHourFn(v Visit) int {
    return v.TimeEntered.Hour()
}

This produces a nice bar chart (by running bazel run codelab -- --example="count" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/count.csv --output_chart_file=$(pwd)/count.png) in the current directory as count.png:

a179766795d4e64a.png

The next step is to convert your pipeline and your bar chart into a private one. We do this as follows.

First, call MakePrivateFromStruct on a PCollection<V> to get a PrivatePCollection<V>. The input PCollection needs to be a collection of structs. We need to input a PrivacySpec and an idFieldPath as input to MakePrivateFromStruct.

spec := pbeam.NewPrivacySpec(epsilon, delta)
pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

PrivacySpec is a struct that holds the differential privacy parameters (epsilon and delta) we want to use to anonymize the data. (You don't need to worry about them for now, we have an optional section later if you would like to learn more about those.)

idFieldPath is the path of the user identifier field within the struct (Visit in our case). Here, the user identifier of the visitors is the VisitorID field of Visit.

Then, we call pbeam.Count() instead of stats.Count(), pbeam.Count() takes as input a CountParams struct which holds parameters such as MaxValue that affect the accuracy of the output.

visitsPerHour := pbeam.Count(s, visitHours, pbeam.CountParams{
    // Visitors can visit the restaurant once (one hour) a day
    MaxPartitionsContributed: 1,
    // Visitors can visit the restaurant once within an hour
    MaxValue:                 1,
})

Similarly, MaxPartitionsContributed bounds how many different visit hours a user can contribute. We expect them to visit the restaurant at most once a day (or we don't care if they visit it multiple times over the course of the day), so we set it to 1 as well. We'll talk about these parameters in more detail in an optional section.

MaxValue bounds how many times a single user can contribute to the values we are counting. In this particular case, the values we are counting are visit hours, and we expect a user to visit the restaurant only once (or we don't care if they visit it multiple times per hour), so we set this parameter to 1.

In the end, your code will look like this:

func PrivateCountVisitsPerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateCountVisitsPerHour")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, delta)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    visitHours := pbeam.ParDo(s, extractVisitHourFn, pCol)
    visitsPerHour := pbeam.Count(s, visitHours, pbeam.CountParams{
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed: 1,
        // Visitors can visit the restaurant once within an hour
        MaxValue:                 1,
    })
    return visitsPerHour
}

We see a similar bar chart (count_dp.png) for the differentially private statistic (the previous command runs both the non-private and the private pipelines):

d6a0ace1acd3c760.png

Congratulations! You calculated your first differentially private statistic!

The bar chart you get when you run the code might be different from this one. That's OK. Because of the noise in differential privacy, you'll get a different bar chart each time you run the code, but you can see that they are more or less similar to the original non-private bar chart we had.

Please note that it is very important for the privacy guarantees to not re-run the pipeline multiple times (for example, in order to get a better-looking bar chart). The reason why you shouldn't re-run your pipelines is explained in the "Computing Multiple Statistics" section.

5. Using Public Partitions

In the previous section, you might have noticed that we dropped all visits (data) for some partitions, i.e. hours.

d7fbc5d86d91e54a.png

This is due to partition selection/thresholding, an important step to ensure differential privacy guarantees when the existence of output partitions depends on the user data itself. When this is the case, the mere existence of a partition in the output can leak the existence of an individual user in the data (See this blog post for an explanation on why this violates privacy). In order to prevent this, Privacy on Beam only keeps partitions that have a sufficient number of users in them.

When the list of output partitions does not depend on private user data, i.e. they are public information, we don't need this partition selection step. This is actually the case for our restaurant example: we know the restaurant's work hours (9.00 to 21.00).

The code for this example is in codelab/public_partitions.go.

We'll simply create a PCollection of hours between 9 and 21 (exclusive) and input it to PublicPartitions field of CountParams:

func PrivateCountVisitsPerHourWithPublicPartitions(s beam.Scope,
    col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateCountVisitsPerHourWithPublicPartitions")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, /* delta */ 0)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    // Create a PCollection of output partitions, i.e. restaurant's work hours
    // (from 9 am till 9pm (exclusive)).
    hours := beam.CreateList(s, [12]int{9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20})

    visitHours := pbeam.ParDo(s, extractVisitHourFn, pCol)
    visitsPerHour := pbeam.Count(s, visitHours, pbeam.CountParams{
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed: 1,
        // Visitors can visit the restaurant once within an hour
        MaxValue:                 1,
        // Visitors only visit during work hours
        PublicPartitions:         hours,
    })
    return visitsPerHour
}

Note that it is possible to set delta to 0 if you are using public partitions and Laplace Noise (default), as is the case above.

When we run the pipeline with public partitions (with bazel run codelab -- --example="public_partitions" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/public_partitions.csv --output_chart_file=$(pwd)/public_partitions.png), we get (public_partitions_dp.png):

7c950fbe99fec60a.png

As you can see, we now keep the partitions 9, 10 and 16 we previously dropped without public partitions.

Not only does using public partitions let you keep more partitions but also it adds roughly half as much noise to each partition compared to not using public partitions due to not spending any privacy budget, i.e. epsilon & delta, on partition selection. That is why the difference between raw and private counts are slightly less compared to the previous run.

There are two important things to keep in mind when using public partitions:

  1. Be careful when deriving the list of partitions from raw data: if you don't do this in a differentially private way, e.g. simply reading the list of all partitions in the user data, your pipeline no longer provides differential privacy guarantees. See the advanced section below on how to do this in a differentially private way.
  2. If there is no data (e.g. visits) for some of the public partitions, noise will be applied to those partitions to preserve differential privacy. For example, if we used hours between 0 and 24 (instead of 9 and 21), all of the hours would be noised and might show some visits when there are none.

(Advanced) Deriving Partitions from Data

If you are running multiple aggregations with the same list of non-public output partitions in the same pipeline, you can derive the list of partitions once using SelectPartitions() and supplying the partitions to each aggregation as the PublicPartition input. Not only is this safe from a privacy perspective, it also lets you add less noise due to using privacy budget on partition selection only once for the entire pipeline.

6. Computing the average length of stay

Now that we know how to count stuff in a differentially private way, let us look into calculating means. More specifically, we will now compute the average length of stay of visitors.

The code for this example is in codelab/mean.go.

Normally, to calculate a non-private mean of stay durations, we would use stats.MeanPerKey() with a pre-processing step that converts the incoming PCollection of visits to a PCollection<K,V> where K is the visit hour and V is the time the visitor spent in the restaurant.

func MeanTimeSpent(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("MeanTimeSpent")
    hourToTimeSpent := beam.ParDo(s, extractVisitHourAndTimeSpentFn, col)
    meanTimeSpent := stats.MeanPerKey(s, hourToTimeSpent)
    return meanTimeSpent
}

func extractVisitHourAndTimeSpentFn(v Visit) (int, int) {
    return v.TimeEntered.Hour(), v.MinutesSpent
}

This produces a nice bar chart (by running bazel run codelab -- --example="mean" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/mean.csv --output_chart_file=$(pwd)/mean.png) in the current directory as mean.png:

bc2df28bf94b3721.png

To make this differentially private, we again convert our PCollection to a PrivatePCollection and replace stats.MeanPerKey() with pbeam.MeanPerKey(). Similar to Count, we have MeanParams that hold some parameters such as MinValue and MaxValue that affect the accuracy. MinValue and MaxValue represent the bounds we have for each user's contribution to each key.

meanTimeSpent := pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
    // Visitors can visit the restaurant once (one hour) a day
    MaxPartitionsContributed:     1,
    // Visitors can visit the restaurant once within an hour
    MaxContributionsPerPartition: 1,
    // Minimum time spent per user (in mins)
    MinValue:                     0,
    // Maximum time spent per user (in mins)
    MaxValue:                     60,
})

In this case, each key represents an hour and values are the time visitors spent. We set MinValue to 0 because we don't expect visitors to spend less than 0 minutes in the restaurant. We set MaxValue to 60, which means if a visitor spends more than 60 minutes, we act as if that user spent 60 minutes.

In the end, your code will look like this:

func PrivateMeanTimeSpent(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateMeanTimeSpent")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, /* delta */ 0)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    // Create a PCollection of output partitions, i.e. restaurant's work hours
    // (from 9 am till 9pm (exclusive)).
    hours := beam.CreateList(s, [12]int{9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20})

    hourToTimeSpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    meanTimeSpent := pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed:     1,
        // Visitors can visit the restaurant once within an hour
        MaxContributionsPerPartition: 1,
        // Minimum time spent per user (in mins)
        MinValue:                     0,
        // Maximum time spent per user (in mins)
        MaxValue:                     60,
        // Visitors only visit during work hours
        PublicPartitions:             hours,
    })
    return meanTimeSpent
}

We see a similar bar chart (mean_dp.png) for the differentially private statistic (the previous command runs both the non-private and the private pipelines):

e8ac6a9bf9792287.png

Again, similar to count, since this is a differentially private operation, we'll get different results each time we run it. But you can see the differentially private lengths of stay are not far off from the actual result.

7. Computing revenue per hour

Another interesting statistic we could look at is revenue per hour over the course of the day.

The code for this example is in codelab/sum.go.

Again, we'll start with the non-private version. With some pre-processing on our mock dataset, we can create a PCollection<K,V> where K is the visit hour and V is the money the visitor spent in the restaurant: To calculate a non-private revenue per hour, we can simply sum all the money visitors spent by calling stats.SumPerKey():

func RevenuePerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("RevenuePerHour")
    hourToMoneySpent := beam.ParDo(s, extractVisitHourAndMoneySpentFn, col)
    revenues := stats.SumPerKey(s, hourToMoneySpent)
    return revenues
}

func extractVisitHourAndMoneySpentFn(v Visit) (int, int) {
    return v.TimeEntered.Hour(), v.MoneySpent
}

This produces a nice bar chart (by running bazel run codelab -- --example="sum" --input_file=$(pwd)/day_data.csv --output_stats_file=$(pwd)/sum.csv --output_chart_file=$(pwd)/sum.png) in the current directory as sum.png:

548619173fad0c9a.png

To make this differentially private, we again convert our PCollection to a PrivatePCollection and replace stats.SumPerKey() with pbeam.SumPerKey(). Similar to Count and MeanPerKey, we have SumParams that hold some parameters such as MinValue and MaxValue that affect the accuracy.

revenues := pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
    // Visitors can visit the restaurant once (one hour) a day
    MaxPartitionsContributed: 1,
    // Minimum money spent per user (in euros)
    MinValue:                 0,
    // Maximum money spent per user (in euros)
    MaxValue:                 40,
})

In this case, MinValue and MaxValue represent the bounds we have for the money each visitor spends. We set MinValue to 0 because we don't expect visitors to spend less than 0 euros in the restaurant. We set MaxValue to 40, which means if a visitor spends more than 40 euros, we act as if that user spent 40 euros.

In the end, the code will look like this:

func PrivateRevenuePerHour(s beam.Scope, col beam.PCollection) beam.PCollection {
    s = s.Scope("PrivateRevenuePerHour")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    spec := pbeam.NewPrivacySpec(epsilon, /* delta */ 0)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    // Create a PCollection of output partitions, i.e. restaurant's work hours
    // (from 9 am till 9pm (exclusive)).
    hours := beam.CreateList(s, [12]int{9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20})

    hourToMoneySpent := pbeam.ParDo(s, extractVisitHourAndMoneySpentFn, pCol)
    revenues := pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed: 1,
        // Minimum money spent per user (in euros)
        MinValue:                 0,
        // Maximum money spent per user (in euros)
        MaxValue:                 40,
        // Visitors only visit during work hours
        PublicPartitions:         hours,
    })
    return revenues
}

We see a similar bar chart (sum_dp.png) for the differentially private statistic (the previous command runs both the non-private and the private pipelines):

46c375e874f3e7c4.png

Again, similar to count and mean, since this is a differentially private operation, we'll get different results each time we run it. But you can see the differentially private result is very close to the actual revenues per hour.

8. Computing multiple statistics

Most of the time, you might be interested in computing multiple statistics over the same underlying data, similar to what you have done with count, mean and sum. It is usually cleaner and easier to do this in a single Beam pipeline and in a single binary. You can do this with Privacy on Beam as well. You can write a single pipeline to run your transformations and computations and use a single PrivacySpec for the whole pipeline.

Not only is it more convenient to do this with a single PrivacySpec, it is also better in terms of privacy. If you remember the epsilon and delta parameters we supply to the PrivacySpec, they represent something called a privacy budget, which is a measure of how much of the privacy of the users in the underlying data you are leaking.

An important thing to remember about the privacy budget is that it is additive: If you run a pipeline with a particular epsilon ε and delta δ a single time, you are spending an (ε,δ) budget. If you run it a second time, you'll have spent a total budget of (2ε, 2δ). Similarly, if you compute multiple statistics with a PrivacySpec (and consecutively a privacy budget) of (ε,δ), you'll have spent a total budget of (2ε, 2δ). This means that you are degrading the privacy guarantees.

In order to circumvent this, when you want to compute multiple statistics over the same underlying data, you are supposed to use a single PrivacySpec with the total budget you want to use. You'd then need to specify the epsilon and delta you want to use up for each aggregation. In the end, you are going to end up with the same overall privacy guarantee; but the higher epsilon and delta a particular aggregation has, the higher accuracy it is going to have.

To see this in action, we can compute the three statistics (count, mean and sum) we computed separately before in a single pipeline.

The code for this example is in codelab/multiple.go. Notice how we are splitting the total (ε,δ) budget equally between the three aggregations:

func ComputeCountMeanSum(s beam.Scope, col beam.PCollection) (visitsPerHour, meanTimeSpent, revenues beam.PCollection) {
    s = s.Scope("ComputeCountMeanSum")
    // Create a Privacy Spec and convert col into a PrivatePCollection
    // Budget is shared by count, mean and sum.
    spec := pbeam.NewPrivacySpec(epsilon, /* delta */ 0)
    pCol := pbeam.MakePrivateFromStruct(s, col, spec, "VisitorID")

    // Create a PCollection of output partitions, i.e. restaurant's work hours
    // (from 9 am till 9pm (exclusive)).
    hours := beam.CreateList(s, [12]int{9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20})

    visitHours := pbeam.ParDo(s, extractVisitHourFn, pCol)
    visitsPerHour = pbeam.Count(s, visitHours, pbeam.CountParams{
        Epsilon:                  epsilon / 3,
        Delta:                    0,
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed: 1,
        // Visitors can visit the restaurant once within an hour
        MaxValue:                 1,
        // Visitors only visit during work hours
        PublicPartitions:         hours,
    })

    hourToTimeSpent := pbeam.ParDo(s, extractVisitHourAndTimeSpentFn, pCol)
    meanTimeSpent = pbeam.MeanPerKey(s, hourToTimeSpent, pbeam.MeanParams{
        Epsilon:                      epsilon / 3,
        Delta:                        0,
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed:     1,
        // Visitors can visit the restaurant once within an hour
        MaxContributionsPerPartition: 1,
        // Minimum time spent per user (in mins)
        MinValue:                     0,
        // Maximum time spent per user (in mins)
        MaxValue:                     60,
        // Visitors only visit during work hours
        PublicPartitions:             hours,
    })

    hourToMoneySpent := pbeam.ParDo(s, extractVisitHourAndMoneySpentFn, pCol)
    revenues = pbeam.SumPerKey(s, hourToMoneySpent, pbeam.SumParams{
        Epsilon:                  epsilon / 3,
        Delta:                    0,
        // Visitors can visit the restaurant once (one hour) a day
        MaxPartitionsContributed: 1,
        // Minimum money spent per user (in euros)
        MinValue:                 0,
        // Maximum money spent per user (in euros)
        MaxValue:                 40,
        // Visitors only visit during work hours
        PublicPartitions:         hours,
    })

    return visitsPerHour, meanTimeSpent, revenues
}

9. (Optional) Tweaking the Differential Privacy parameters

You have seen quite a few parameters mentioned in this codelab: epsilon, delta, maxPartitionsContributed, etc. We can roughly divide them into two categories: Privacy Parameters and Utility Parameters.

Privacy Parameters

Epsilon and delta are the parameters that quantify the privacy we are providing by using differential privacy. More precisely, epsilon and delta are a measure of how much information a potential attacker gains about the underlying data by looking at the anonymized output. The higher epsilon and delta are, the more information the attacker gains about the underlying data, which is a privacy risk.

On the other hand, the lower epsilon and delta are, the more noise you need to add to the output to be anonymous, and a higher number of unique users you need have in each partition to keep that partition in the anonymized output. So, there is a tradeoff between utility and privacy here.

In Privacy on Beam, you need to be worried about the privacy guarantees you want in your anonymized output when you specify the total privacy budget in the PrivacySpec. The caveat is that if you want your privacy guarantees to hold, you need to follow the advice in this codelab about not overusing your budget by having a separate PrivacySpec for each aggregation or running the pipeline multiple times.

For more information about Differential Privacy and what the privacy parameters mean, you can take a look at the literature.

Utility Parameters

These are parameters that don't affect the privacy guarantees (as long as advice on how to use Privacy on Beam is properly followed) but affect the accuracy, and consequently the utility of the output. They are provided in the Params structs of each aggregation, e.g. CountParams, SumParams, etc. These parameters are used to scale the noise being added.

A utility parameter provided in Params and applicable to all aggregations is MaxPartitionsContributed. A partition corresponds to a key of the PCollection outputted by a Privacy On Beam aggregation operation, i.e. Count, SumPerKey, etc. So, MaxPartitionsContributed bounds how many distinct key values a user can contribute to in the output. If a user contributes to more than MaxPartitionsContributed keys in the underlying data, some of her contributions will be dropped so that she contributes to exactly MaxPartitionsContributed keys.

Similar to MaxPartitionsContributed, most aggregations have a MaxContributionsPerPartition parameter. They are provided in the Params structs and each aggregation could have separate values for them. As opposed to MaxPartitionsContributed, MaxContributionsPerPartition bounds a user's contribution for each key. In other words, a user can contribute only MaxContributionsPerPartition values for each key.

The noise added to the output is scaled by MaxPartitionsContributed and MaxContributionsPerPartition, so there is a tradeoff here: Larger MaxPartitionsContributed and MaxContributionsPerPartition both mean you keep more data, but you'll end up with a more noisy result.

Some aggregations require MinValue and MaxValue. These specify the bounds for contributions of each user. If a user contributes a value lower than MinValue, that value is going to be clamped up to MinValue. Similarly, if a user contributes a value larger than MaxValue, that value is going to be clamped down to MaxValue. This means that in order to keep more of the original values, you have to specify larger bounds. Similar to MaxPartitionsContributed and MaxContributionsPerPartition, noise is scaled by the size of the bounds, so larger bounds mean you keep more data, but you'll end up with a more noisy result.

Last parameter we'll be talking about is NoiseKind. We support two different noise mechanisms in Privacy On Beam: GaussianNoise and LaplaceNoise. Both have their advantages and disadvantages but Laplace distribution gives better utility with low contribution bounds, that is why Privacy On Beam uses it by default. However, if you wish to use a Gaussian distribution noise, you can supply Params with a pbeam.GaussianNoise{} variable.

10. Summary

Great job! You finished the Privacy on Beam codelab. You learned a lot about differential privacy and Privacy on Beam:

  • Turning your PCollection into a PrivatePCollection by calling MakePrivateFromStruct.
  • Using Count to compute differentially private counts.
  • Using MeanPerKey to compute differentially private means.
  • Using SumPerKey to compute differentially private sums.
  • Computing multiple statistics with a single PrivacySpec in a single pipeline.
  • (Optional) Customizing the PrivacySpec and aggregation parameters (CountParams, MeanParams, SumParams).

But, there are many more aggregations (e.g. quantiles, counting distinct values) you can do with Privacy on Beam! You can learn more about them on the GitHub repository or the godoc.

If you have the time, please give us feedback about the codelab by filling a survey.