Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interoperable Private Attribution (IPA) #9

Open
eriktaubeneck opened this issue Jan 6, 2022 · 64 comments
Open

Interoperable Private Attribution (IPA) #9

eriktaubeneck opened this issue Jan 6, 2022 · 64 comments

Comments

@eriktaubeneck
Copy link
Collaborator

@benjaminsavage, @martinthomson, and I have been working on a proposal, "Interoperable Private Attribution (IPA)" that addresses the aggregate attribution measurement use case, similar to those listed in #8.

We'd love to have this considered and discussed at the January PATCG meeting, for consideration in maturing it further through collaboration among this community group.

@tgeoghegan
Copy link

Section 4.1 states that the "aim is for IPA to be compatible with the Privacy Preserving Measurement (PPM) specification". Does that mean you intend to express IPA as a VDAF?

@eriktaubeneck
Copy link
Collaborator Author

eriktaubeneck commented Jan 6, 2022

Does that mean you intend to express IPA as a VDAF?

Yes, our intention is to work towards that. There are a few major components, some which likely require expression as a new VDAF, but we hope to leverage the existing work with prio3 and/or poplar1 where possible.

@alextcone
Copy link

Consider me a strong +1 in support of this getting on the agenda.

@ekr
Copy link

ekr commented Jan 25, 2022

In case it helps, I spent a bunch of time working through the math of IPA in some detail at: https://educatedguesswork.org/posts/vaccine-tracking/. I was interested in another application, but if you found the ElGamal blinding and shuffling a bit had, this might help.

@ansuz
Copy link

ansuz commented Feb 11, 2022

@eriktaubeneck It seems like permissions are required to view the draft on google docs. Is a publicly accessible version of the draft available anywhere else?

@eriktaubeneck
Copy link
Collaborator Author

@ansuz document is back up, though it is now read-only.

@gsnedders
Copy link

While the proposal mentions:

We would also like to call out the work happening in the WICG on the Attribution Reporting API and Privacy Preserving Ads, which was highly influential to this work.

It would be nice for the proposal to directly compare itself with such prior work (and also the Privacy CG's Private Click Measurement). As it is, it is not immediately clear as to what the motivations are behind this proposal rather than furthering work on developing the other proposals.

@ShivanKaul
Copy link

Is there a reason all the questions on the doc were removed + ability to comment revoked?

@Lexicality
Copy link

Presumably because the internet is currently extremely angry about this?

@benjaminsavage
Copy link

Is there a reason all the questions on the doc were removed + ability to comment revoked?

The document was completely defaced, fully deleted with "suggestions" and replaced with vulgarities. As such, the document is now "read-only" access.

@ShivanKaul
Copy link

Ah, I see, sorry to hear that. Is the plan to move it to GitHub? Last time I read it there were a few undefined terms and the flow was not entirely clear to me, would be good to get clarifying answers.

@bmayd
Copy link

bmayd commented Feb 13, 2022

The document was completely defaced, fully deleted with "suggestions" and replaced with vulgarities. As such, the document is now "read-only" access.

@benjaminsavage Does this lead you to consider docs unsuitable as a collaboration tool for this work or do you think it can be avoided going forward? I don't want to continue advocating for their use if the latter is not the case.

@Med1cinal
Copy link

Med1cinal commented Feb 13, 2022

Does this lead you to consider docs unsuitable as a collaboration tool for this work or do you think it can be avoided going forward? I don't want to continue advocating for their use if the latter is not the case.

this might be a solution

@santirely
Copy link

Having looked at only the non-technical presentation, I have a couple of comments / questions.

  1. Where would the match keys be stored in the device / browser / OS?
  2. Could we adapt this solution so that it isn't so dependent on companies with a "large footprint"? Although as proposed this does appear to be the most logical solution, it also poses a huge limitation in my view. What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?
@eriktaubeneck
Copy link
Collaborator Author

@medicinalcocaine3434 the document was using suggested edits and comments, it was with suggested edits that the document was defaced.

@santirely if you take a look at the technical proposal, you'll find details on your question. Briefly:

Where would the match keys be stored in the device / browser / OS?

We are proposing a new read-only API, which the browser/OS would expose.

Could we adapt this solution so that it isn't so dependent on companies with a "large footprint"?

Any website/app is able to write a match key, so it's not dependent on any set of companies. However, the more cross-device coverage a given companies match key has, the more accurate attribution that uses that match key will be.

What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?

We are proposing that any site/app be able to reference any match key. Match keys are not shared, and the ability to reference them is not controlled by the companies that set them.

@santirely
Copy link

Ok, that's interesting. A couple follow ups:

Any website/app is able to write a match key, so it's not dependent on any set of companies. However, the more cross-device coverage a given companies match key has, the more accurate attribution that uses that match key will be.

This is true but a significant portion of the value created by the proposal is cross-device tracking, and these companies adopting the solution would be important for that to actually work. I was aiming at something like: What if other smaller apps / publishers could pool their match keys in a way that benefits everyone? For example, a gaming studio like say Epic will have tons of mobile and CTV match keys but almost no web based ones. Opposite is true for someone like The New York Times. Could they create a sort of coop there?

What happens if suddenly Facebook or Google decide they'll only share their match keys with companies that play well with them?

This seems ideal, but isn't that a potential drawback for someone that can provide cross-device atribution by itself? What's Meta's or Google's incentive to be the world's match key providers there?

@Lexicality
Copy link

The proposal also seems to assume that the user is logged in to at least one match key provider. What happens if the user is not? Does the browser make up a device identifier? Does it encrypt null? (Presumably collisions could abound there) or does it return an error to the calling script?
If a user doesn't want to sign in to Facebook (or uses Facebook Container) does that mean they will never be able to be attributed to a conversion?

@Lexicality
Copy link

To give a specific example, say I'm a new user. I install Firefox for the first time and see a full screen advert for Pocket and go "wow this looks great" and decide to sign up for the premium service. Since my browser is in a completely fresh state I won't have any match keys set up at all. How is Mozilla going to know if buying pocket was a good idea or not?

@eriktaubeneck
Copy link
Collaborator Author

This seems ideal, but isn't that a potential drawback for someone that can provide cross-device atribution by itself? What's Meta's or Google's incentive to be the world's match key providers there?

In the absence of 3rd party cookies, cross-site (including cross-device) attribution won't be possible "by itself". It will require a new purpose constrained API. All companies incentives to participate will be to power their own attribution (with the side effect of enabling all attribution.)

@eriktaubeneck
Copy link
Collaborator Author

The proposal also seems to assume that the user is logged in to at least one match key provider. What happens if the user is not? Does the browser make up a device identifier?

This is still yet to be determined. One idea is that the device could generate a random match key, which would at least default to "same device attribution".

If a user doesn't want to sign in to Facebook (or uses Facebook Container) does that mean they will never be able to be attributed to a conversion?

This would entirely depend which match key providers the source sites and trigger sites are using. If those sites are using match key providers that the user is not logged into, then they attribution would likely be missed. 100% coverage have never been possible, but this API is designed to create as much coverage as possible, without enabling user level tracking.

@santirely
Copy link

In the absence of 3rd party cookies, cross-site (including cross-device) attribution won't be possible "by itself". It will require a new purpose constrained API. All companies incentives to participate will be to power their own attribution (with the side effect of enabling all attribution.)

So what you're saying is that Meta for example would be OK sharing their "match keys" with say Twitter (since Meta's reach is significantly higher, why would Twitter use its own?) because that way the advertiser can also use the match key (and Meta wouldn't be able to match its conversions with the advertiser otherwise). And since they can't decide to share only with the advertiser, they'd be fine with sharing with everyone else.

Seems pretty far-fetched to be honest. Also, isn't Meta's or Google's reach enough that they can match against advertiser's first party logged in data and get even better results?

@Lexicality
Copy link

Lexicality commented Feb 14, 2022

So what you're saying is that Meta for example would be OK sharing their "match keys" with say Twitter (since Meta's reach is significantly higher, why would Twitter use its own?)

If Twitter wants to use Facebook's match keys, that means all the users they show adverts to also need to be logged in to Facebook. This means Twitter needs to incentivise its users to log in to Facebook which directly benefits Facebook.

That's the entire reason Facebook has come up with this proposal - in order for it to work the vast majority of the internet needs to have been identified in some fashion by Facebook, so everyone that uses it will pass their users via Facebook and let them slurp up their data.

You can say "well it works with any provider" but to work effectively it needs a major provider and Google are off doing their own thing so ...

@bmilekic
Copy link

If Twitter wants to use Facebook's match keys, that means all the users they show adverts to also need to be logged in to Facebook. This means Twitter needs to incentivise its users to log in to Facebook which directly benefits Facebook.

I don't see how that's true. If Twitter is the publisher, then it can ask their advertisers to register trigger events referencing only twitter's match keys. There is no need for the facebook match keys in that scenario, especially since the ads are running on twitter and so the lack of twitter user ID implies no possible match with advertiser target events.

For non-FB smaller publishers, the proposal provides an important theoretical benefit, in that a publisher can choose to register source events leveraging facebook, twitter, and other third-party keys. In an IPA implementation supporting multiple match keys, this ultimately benefits the publisher as it increases potential match rates with advertiser data.

It remains to be seen what the motivation could be for a match key provider to act as such, but presumably "having an advertising business" would be one motivating reason. I believe that making the match keys usable by other parties in that context is more fair than doing the opposite.

@Lexicality
Copy link

I don't see how that's true. If Twitter is the publisher, then it can ask their advertisers to register trigger events referencing only twitter's match keys. There is no need for the facebook match keys in that scenario, especially since the ads are running on twitter and so the lack of twitter user ID implies no possible match with advertiser target events.

I agree it doesn't make much sense, but in the hypothetical that they did want to only use someone else's match key (for whatever reason) then I think my point still stands

It remains to be seen what the motivation could be for a match key provider to act as such, but presumably "having an advertising business" would be one motivating reason. I believe that making the match keys usable by other parties in that context is more fair than doing the opposite.

I don't think fairness comes in to many business decisions. It costs Facebook nothing to allow its competitors to use its match keys, and if everyone relies on them they gain a position of power over the discourse, even if it's just an implicit one.

To be clear, just because I feel like this proposal further entrenches the big players in the ad business by relying on centralised identity services doesn't mean I think it's a bad proposal. As long as the cryptographic stuff works and the ad networks are somehow coerced into dropping their other tracking methods, this is a big step up. But on the other hand if the crypto stuff has a hidden weakness in it and Facebook run one of the "trusted" servers, this is a terrible idea.

@benjaminsavage
Copy link

A few comments in response the the thread so far:

  1. On the topic of "what if no match keys are set"

As @eriktaubeneck said - I like the idea of the device just generating a random matchkey. That way the API seamlessly defaults to "same-device-only" attribution, which is at least at par with other proposals.

  1. On the topic of "Who can use a match key once it is set?"

The reason we proposed allowing any company to benefit from match keys set by any other participant, was specifically to try to avoid any kind of system which could be abused by large established players. As @santirely mentions, this would give them a lot of leverage to, as he says, choose to only share access with businesses who "play well with them". We opted for an "open reference" proposal specifically to avoid this type of risk.

  1. On the topic of: "What incentive would a company with a large footprint have for setting an open-reference match key?"

As @eriktaubeneck points out, browsers and mobile operating systems are rapidly clamping down on "tracking". Various regulations are doing the same. This means that all businesses (even those with a large footprint) as steadily losing the ability to accurately count the number of conversions attributable to advertising. In a theoretical future world where cookies and device identifiers are all gone, and fingerprinting is impossible, having a "large footprint" will be useless from the perspective of counting conversions which occur off-network on other apps and websites. In such a world, if the only option available for counting conversions is a highly private one, like IPA, then I believe businesses who sell ads will use it (they won't have a choice). In that world, they'll have two options:
(i) Do not set a match key. Use a match key set by some other entity
(ii) Set a match key - accepting that anyone else who wants to can also use it.

Each entity will have to weigh these alternatives. For a business with a "large footprint" of users who sign in across multiple devices, here is how I think these choices will look:
(i) Do not set a match key: If other match-keys are from businesses with a smaller network of users logged-in across devices, taking this approach will have the un-desireable side-effect of undercounting the true number of conversions their ads actually drive. In summary: Less accurate measurement.
(ii) Set a match key: This will result in more accurate ads measurement - with higher counts of attributed conversions, which more accurately measures the number of conversions their ads drive. As a side-effect however, all competitors will also benefit from more accurate measurement of their ads. In summary: More accurate, but more accurate for everyone.

I posit that there exist businesses for whom the calculus is in favor of option (ii), more accurate measurement being more beneficial than everyone having less accurate measurement.

  1. On the topic of "does this require users to be logged into Facebook?". In the proposal, we talk about the prospect of supporting multiple match keys. We think we can support this without needing to give up any privacy benefits. If that is true, then it would seem optimal for any consumer of this API to select a basket of match-keys which collectively provide good coverage. This has the additional benefit of minimizing the reliance on a single point of failure. I can envision a future where it is common to specify a handful of "large footprint" match key providers to get a good baseline, a few region specific ones to cover parts of the globe which would otherwise be poorly covered, potentially one's own match-key, and finally falling back on the random, per-device specified match key which essentially just provides "same-device only" attribution.

I think all parties (including "large footprint" entities) would all have similar incentives to push them in this fashion.

We've also put a lot of time and thought into trying to ensure there isn't coupling between entities. We think we can design the system in such a way that we do not require collaboration. That is, we want a system where any advertiser who runs ads across N platforms can independently specify which match-keys they want to use, without needing those platforms to all agree with them, or all need agree on something.

As long as the cryptographic stuff works and the ad networks are somehow coerced into dropping their other tracking methods, this is a big step up. But on the other hand if the crypto stuff has a hidden weakness in it and Facebook run one of the "trusted" servers, this is a terrible idea

First of all, I assume that Facebook / Google / any ad-tech company will never be trusted to operate a helper server =). This will be enforced by browsers. They'll have to decide which public keys they are willing to use to encrypt reports. I cannot imagine a world in which Firefox would trust Facebook enough to encrypt these events using Facebook's public key =). I'm assuming we will see non-profits with strong privacy reputations operating the servers, or possibly the types of organizations which operate Apple's "Private Relay" service.

Secondly: Yes, exactly. This proposed system would be a big step up for privacy compared to the status quo mechanisms used to count conversions. I have no expectation that browsers and mobile operating systems will stop trying to clamp down on fingerprinting. Actually, if anything I expect them to accelerate those efforts. I also expect to see more and more regulation along these lines.

That the math works out, and we have a strong privacy guarantee is the key. This is why we are trying to work out in the open - we think that's the best way to find all the problems / issues, and to get help finding solutions to them. We've already benefitted tremendously from outside input. @betuldurak found a really clever attack that a malicious helper node could do. I'm really grateful to her for telling us about it! We're working on finding a solution as we speak.

I think the path towards standardization looks like a bunch of iterations out in the open, publishing papers, getting feedback, addressing problems, repeat. I hope that we can eventually converge on a design that is super solid. I wouldn't expect browser vendors to feel comfortable shipping an API like this unless a bunch of independent academics were all convinced that it met our design goals.

@chris-wood
Copy link

I think the path towards standardization looks like a bunch of iterations out in the open, publishing papers, getting feedback, addressing problems, repeat. I hope that we can eventually converge on a design that is super solid. I wouldn't expect browser vendors to feel comfortable shipping an API like this unless a bunch of independent academics were all convinced that it met our design goals.

Agreed on the approach =) What's the best way to follow along with the proposed solution(s) that you're working on to address @betuldurak's attack? Is the attack documented anywhere?

@martinthomson
Copy link
Collaborator

https://educatedguesswork.org/posts/ipa-overview/#appendix%3A-linear-relation-attacks perhaps.

We've initiated a few discussions with cryptographers; nothing public as yet.

@sthaase
Copy link

sthaase commented Feb 16, 2022

What role do regulatory requirements such as GDPR / ePrivacy in Europe play in the solution discovery & design from your perspective? That is one aspect I rarely read about in these proposals, yet I believe that this should be an integral part of the problem definition and solution design.

Looking at IPA specifically for example, I believe that data protection authorities might categorize the match key as personal data (https://gdpr.eu/eu-gdpr-personal-data/) and storing it on the users device would therefore require user consent. Would you just accept that as a given, or could solutions be more tailored towards regulatory requirements (in a sense: try to discover solutions do not require user consent to not end up modeling 30% of conversions that are lost due to tracking opt-outs).

@benjaminsavage
Copy link

You're totally right @csharrison. My thinking is that given each site owner can independently choose which match-key providers they work with, any provider who does something nasty like choosing a uniform value of the match key, will instantly lose trust and develop a bad reputation - leading to nobody using them again.

@eriktaubeneck
Copy link
Collaborator Author

@csharrison another approach would be to structure the inclusion of matchkeys in the report as a key:value of provider:matchkey, instead of just a set of matchkeys, i.e.

{
    "provider1.com": matchkey_1,
    "provider2.com": matchkey_2,
    ...
}

Then, at query time, you could tell the aggregators: "only join on provider1.com". The aggregators could respect that choice in the joining, but still account for budgeting against the full set of matchkeys.

This still wouldn't fully solve the abuse scenario, however, because if someone were to simply set a uniform matchkey, that would likely still disrupt the budget accounting and contribution capping (in which case you'd still need reputational effects which @benjaminsavage proposes.)

@csharrison
Copy link
Collaborator

Another question for the IPA proposal. The document mentions it should be possible for third parties to make requests on behalf of other 1Ps. I agree this is a good feature. One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

There are many mitigations for this, but it would be good to spell them out. The most obvious one is that if the match key space is high entropy enough, this is just straight up difficult. However, I don't know if we want to design something more robust such that e.g. 1Ps need to attest to working with certain 3Ps up front.

@eriktaubeneck
Copy link
Collaborator Author

@csharrison agreed that this is underspecified, and this would be a great area to get more clarity on.

[Administrative side note, I opened a request to get a repo specifically for IPA so we can have issues to dedicated topics, and even put together pull requests for docs outline more details in these areas as they emerge.]

A few thoughts specific to to this question:

One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

In the case where the 3P has actual source_events and trigger_events, this could be possible without even generating fake data. We allow for individual events to be used in multiple queries, within the privacy budget, so this could be used to exhaust it. In this case, I don't think that making the match key space high entropy would actually work.

In the case where the 3P doesn't have actual events, but is just trying to disrupt some 1P's budget, the high entropy match key space would work.

I don't know if we want to design something more robust such that e.g. 1Ps need to attest to working with certain 3Ps up front.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

In the second scenario, it seems like a high entropy match key is enough (say 64-bit) where it would be far too expensive to run a query that would actually have meaningful impact. Let's suppose (very conservatively) that it only takes 1ms to generate a fake event - to cover 0.4% (1/256) of the space it would take over 2M years of compute time to generate all those events. And that's not even starting to think about actually running that query...

That said, if a 1P wants to work with more than one 3P, then we do probably need a way for that 1P to assign specific portions of its budget across those different 3Ps, which may necessitate the attestation design you mention.

@csharrison
Copy link
Collaborator

[Administrative side note, I opened a https://github.com/patcg-individual-drafts/admin/issues/1 to get a repo specifically for IPA so we can have issues to dedicated topics, and even put together pull requests for docs outline more details in these areas as they emerge.]

Thanks, yeah. This one issue is getting very cumbersome haha.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

It's worth thinking through this scenario to see if we could detect / tolerate this. As far as I understand things, it is notoriously difficult for 1Ps to make configuration changes on their sites, so if we are relying on that to deter cheaters it's not ideal.

In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code.

Hm, this made me look back at the IPA doc to see how privacy budgeting is done. For a given report in the MPC system, how do we know the site it is associated with for purposes of budget? The issue I am hoping we can avoid is something like a re-randomization attack where a 3P gets real events for advertiser A but can somehow use the match key to steal budget from advertiser B while having the report look new. I think we need to make sure the budget keys are not tamperable basically.

I think I agree with you about the high entropy protecting us a great deal from the "guessing" attack. If we can show that's the worst an adversary can do I might be comfortable with it.

@eriktaubeneck
Copy link
Collaborator Author

It's worth thinking through this scenario to see if we could detect / tolerate this. As far as I understand things, it is notoriously difficult for 1Ps to make configuration changes on their sites, so if we are relying on that to deter cheaters it's not ideal.

I agree if we're talking about a cheater using information from site A to impact something about site B. However, if site A is willing to give a "cheater" ability to execute JS on their site, how we can prevent anything beyond that.

For a given report in the MPC system, how do we know the site it is associated with for purposes of budget?

This is a good question. I have a few ideas here, but there are some tradeoffs. I'll open an issue for this specifically once the other repo is created.

@csharrison
Copy link
Collaborator

I agree if we're talking about a cheater using information from site A to impact something about site B. However, if site A is willing to give a "cheater" ability to execute JS on their site, how we can prevent anything beyond that.

Great point. There might be some nuance here with iframes, but even still there is a detection problem that we should think through. This goes back to the general problem of supporting multiple separate reporting origins though which we should flesh out. It ends up being a complicated coordination problem (and possible denial-of-service vector) if everyone has to share a single budget. Needing to involve the advertiser in it makes this even tougher.

One more question about IPA behavior. I want to confirm that attribution across multiple queries works correctly. Here's an example: imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns. This is my understanding of how this would work in IPA.

The advertiser will send 3 queries to the system:

  1. {Campaign1's source events, all trigger events}
  2. {Campaign2's source events, all trigger events}
  3. {Campaign3's source events, all trigger events}

If IPA treats these queries completely independently, then attribution does not take into account source events from separate queries. That is, a hypothetical user journey like {Campaign1 source event, Campaign2 source event, Campaign 3 source event, trigger} will end up contributing a count to each of the three queries above, causing double counting.

One way to make this work would be to first run "global attribution" with the union of the events in all the queries, and then separately evaluate each query separately from the pool of globally attributed sources/triggers. I couldn't tell if this was how the protocol was intended to work though.

@benjaminsavage
Copy link

Another question for the IPA proposal. The document mentions it should be possible for third parties to make requests on behalf of other 1Ps. I agree this is a good feature. One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P.

Here's how I've been thinking about this:

When a report collector makes an IPA query, it will cost them some amount of money. You have to pay the MPC helper nodes for the compute you use. This implies the existence of some kind of registration process whereby a site / app signs up to run IPA queries, proves ownership of the app / site, and inputs an associated payment instrument.

So I am assuming all IPA queries will be authenticated server-to-server calls. Authentication parameters must be provided to run the query. As such, it should be impossible for anyone but the 1st party, or their legitimate delegate to run queries. If a delegate abuses their permissions, the 1st party should be able to revoke their permission to run IPA queries on their behalf.

@benjaminsavage
Copy link

One more question about IPA behavior. I want to confirm that attribution across multiple queries works correctly. Here's an example: imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns. This is my understanding of how this would work in IPA.

The advertiser will send 3 queries to the system:

{Campaign1's source events, all trigger events}
{Campaign2's source events, all trigger events}
{Campaign3's source events, all trigger events}
If IPA treats these queries completely independently, then attribution does not take into account source events from separate queries. That is, a hypothetical user journey like {Campaign1 source event, Campaign2 source event, Campaign 3 source event, trigger} will end up contributing a count to each of the three queries above, causing double counting.

One way to make this work would be to first run "global attribution" with the union of the events in all the queries, and then separately evaluate each query separately from the pool of globally attributed sources/triggers. I couldn't tell if this was how the protocol was intended to work though.

In the event an advertiser wants to evaluate the relative performance of 3 campaigns (which they might have purchased from different ad-sellers) I assume that they would NOT issue three separate queries as you’ve shown. This would wind up hitting their differential privacy budget three times for the same set of trigger events. They’d be far better off running a single query with all of the source events from all three campaigns, and all of the trigger events. This would make much better use of their budget, as well as enable “global attribution”, where we can avoid double counting.

To be clear, I understand this is a significant departure from how things work today. Today Facebook ads manager shows just an FB view of things. In an IPA world, it would not be possible to show them this. It wouldn’t be an efficient use of their privacy budget. It would be much more similar to the mobile app ecosystem where advertisers utilize 3rd party “mobile measurement partners” that give them a unified view across all their ad buying channels, preferring to view reporting there and eschewing platform-specific reporting channels.

@csharrison
Copy link
Collaborator

csharrison commented May 9, 2022

In the event an advertiser wants to evaluate the relative performance of 3 campaigns (which they might have purchased from different ad-sellers) I assume that they would NOT issue three separate queries as you’ve shown. This would wind up hitting their differential privacy budget three times for the same set of trigger events. They’d be far better off running a single query with all of the source events from all three campaigns, and all of the trigger events. This would make much better use of their budget, as well as enable “global attribution”, where we can avoid double counting.

I think I might be missing something. Is this use-case possible to achieve with IPA:

imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns.

i.e. I want a break-out that says:
Campaign1: led to 10 conversions
Campaign2: led to 15 conversions
Campaign3: led to 150 conversions

My thought from the doc was this is accomplished via carefully sending relevant source events, but it seems like there is some other way this should be done. Here is the relevant piece from the doc:

Note that source.example can use its own context and the context provided by trigger sites to group these queries into relevant sets. For example, if the source reports were a set of ad impressions, source.example could choose to run a query for a specific campaign, and only include trigger reports for items relevant to that campaign.

Now that is specific to a source query, but I assumed you'd do the same for trigger queries like the one I described.

@eriktaubeneck
Copy link
Collaborator Author

I think I might be missing something. Is this use-case possible to achieve with IPA:

imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns.

i.e. I want a break-out that says: Campaign1: led to 10 conversions Campaign2: led to 15 conversions Campaign3: led to 150 conversions

My thought from the doc was this is accomplished via carefully sending relevant source events, but it seems like there is some other way this should be done. Here is the relevant piece from the doc:

Note that source.example can use its own context and the context provided by trigger sites to group these queries into relevant sets. For example, if the source reports were a set of ad impressions, source.example could choose to run a query for a specific campaign, and only include trigger reports for items relevant to that campaign.

Now that is specific to a source query, but I assumed you'd do the same for trigger queries like the one I described.

Our wording in the doc may not have been super clear - there are two different cases to consider here.

The first case is the one you mention, you would want to issue a single query, with all the events. It would be something like the following SQL query:

select 
    source_event.campaign_id
  , count(trigger_event.event_id)
  , sum(trigger_event.value)
from
    source_events
    join trigger_events
    on <matchkeys and attribution logic>
group by
    source_event.campaign_id

The second case is where there are multiple distinct products involved, such as:

{
    (campaign_1a, campaign_1b, ...) : product_1,
    (campaign_2a, campaign_2b, ...): product_2,
    ...
}

In this case, since these queries can be constructed entirely independently, the advertiser running the query should be able to bifurcate them appropriately and run the same query as above, without having an effect on the results. In that case, having less data should be more efficient, and also not exhaust unnecessary privacy budget. It would also prevent the need for more complicated attribution logic in the MPC (since you'd only want attribution within that appropriate mapping.)

@csharrison
Copy link
Collaborator

Thanks @eriktaubeneck , I think I missed the piece where we can annotate source events by their relevant campaign ID. I wasn't sure if that was supported.

@csharrison
Copy link
Collaborator

I guess I will follow-up: how much extra information can we pack into the events? One of the benefits of creating queries as a "bag of relevant events" is that we can use arbitrarily complex information to structure the queries. Once the splitting has to happen within the protocol though, it becomes harder, especially with MPC. Could you imagine us supporting many dimensions of features beyond campaign IDs in IPA?

@eriktaubeneck
Copy link
Collaborator Author

campaign_id is probably a bad name, TBH. group_id is better, and because the source_event is sent back in the context it was called, it shouldn't need to be bound to the event when called. I've been thinking that one could assign the group_id after the fact (and even change it for two different queries.) The only restriction would be reaching the set grouping threshold (+DP noise).

I haven't thought about assigning multiple group_ids to a single event within a single query, and I have the same intuition around increasing complexity in the MPC. Are there scenarios where this would be important to have multiple in a single query (as opposed to sequential, independent queries?)

@csharrison
Copy link
Collaborator

csharrison commented May 9, 2022

If we are allowed to set group_id after the fact (i.e. at query time) it resolves my primary concern, i.e. that it should be possible to make complicated queries from arbitrary data at query time. This is one of the downsides of the Attribution Reporting API in that you need to decide at report-creation time what the query needs to be, and the architecture of IPA seems well suited to solve that problem.

I will need to think a bit (and understand more about IPA) if there's any benefit to multiple group_ids per record in a single query. I do believe there will be a large benefit to batching queries, so if that can be supported in an efficient way it might not be necessary. For example, it would be great to have an answer for how we would support the Criteo experiment in IPA (even theoretically). In that setting, we have each attributed event contributing separately to 171 buckets.

@benjaminsavage
Copy link

I agree with @eriktaubeneck. Let me just elaborate a bit:

Since the source_event is generated in-context, you'll know all the relevant queries in which you might like to use it. Perhaps a country breakdown, an age-and-gender breakdown, a placement breakdown, etc.

I'm assuming that the "group_id" can be added server-side at will, and the events can be utilized in multiple queries.

So an advertiser who wishes to get multiple breakdowns for their conversions would have to decide how much of their privacy budget to spend per-breakdown, then could issue multiple queries using the same events.

As for could it contribute to 171 buckets? I think the answer depends entirely on the privacy budget.

@eriktaubeneck
Copy link
Collaborator Author

For example, it would be great to have an answer for how we would support the Criteo experiment in IPA (even theoretically). In that setting, we have each attributed event contributing separately to 171 buckets.

I agree that this seems like a good goal worth shooting for. At the moment, I'd be happy to start with something simple (like last touch or even credit over N touches), but with the flexibility to get more complicated.

@csharrison
Copy link
Collaborator

csharrison commented May 9, 2022

I agree that this seems like a good goal worth shooting for. At the moment, I'd be happy to start with something simple (like last touch or even credit over N touches), but with the flexibility to get more complicated.

To be clear, this example is using last-touch attribution. It's just that we want to sum up not just campaign counts but also other features, so we can know things like "how many attributed events had feature X", "how many attributed events had feature Y", etc.

@benjaminsavage yes, privacy budget :) Actually this batching queries thing gives you a super power, budget wise, which is exactly what Criteo takes advantage of in their competition. See also this thread

@benjaminsavage
Copy link

A quote from that thread:

We believe that there are great benefits in aggregating reports from multiple source_site (or attribution_destination, depending on the use case) in a single request, to lower the overall level of noise.

I agree. I would really like for IPA to support queries where the source_events span multiple source sites. I think this is a key use-case for ad-networks that show ads across the open web. We discuss this possible extension in our IPA proposal in the "business privacy grain" section. It's really hard though, and we haven't yet worked through all the issues with this. In particular, it requires careful design to ensure a malicious helper node cannot violate the "Vegas Rule".

Reading through that thread, the use-case is really about training ML, not reporting. Rather than trying to get hundreds of independent breakdowns out of the API, it would probably be more efficient (from a DP perspective) to just train an ML model in MPC, and emit a trained model (with DP noise added). We allude to this as a possible future extension: link. This would have the added benefit of being able to model the interaction effects between these features.

@csharrison
Copy link
Collaborator

csharrison commented May 9, 2022

I agree. I would really like for IPA to support queries where the source_events span multiple source sites. I think this is a key use-case for ad-networks that show ads across the open web. We discuss this possible extension in our IPA proposal in the "business privacy grain" section. It's really hard though, and we haven't yet worked through all the issues with this. In particular, it requires careful design to ensure a malicious helper node cannot violate the "Vegas Rule".

I don't think business privacy grain is necessary here. The thread there is about combining reports across publishers e.g. for a given advertiser. My understanding is that IPA supports this by default (and uses, in that example, the advertiser privacy unit).

Reading through that thread, the use-case is really about training ML, not reporting. Rather than trying to get hundreds of independent breakdowns out of the API, it would probably be more efficient (from a DP perspective) to just train an ML model in MPC, and emit a trained model (with DP noise added). We allude to this as a possible future extension: link. This would have the added benefit of being able to model the interaction effects between these features.

Yes, I used this mostly as an example, to understand the limitations of IPA. Obviously if we can train models directly in IPA it will probably be more efficient, but supporting the Criteo competition setting is a decent litmus test on how powerful the reporting use-case is. As far as I understand, supporting a setting like this could allow us to do logistic regression in a pretty privacy-efficient way.

Oh let me cc @alois-bissuel since I am bringing up some Criteo stuff :)

@juanli16
Copy link

Having read the IPA proposal, I have the following question concerning the addition of differential private noise on aggregated trigger values. How does one compute (estimate) the sensitivity of the trigger value in a setting like IPA, where the range of the trigger values are never revealed to the aggregators.

Do you in this case apply local differential privacy, meaning the trigger sites that generates these trigger events, add already properly sampled local differential private noise to each trigger value before encrypting it and submit it to the MPC network?

@benjaminsavage
Copy link

Hi @juanli16 - and thanks for the question!

Our current thinking is that the API caller would provide some kind of "zero knowledge proof" along with each trigger event, proving that the trigger value lies within a given range.

The actual range would also need to be an API param, as the MPC would need to add noise proportional to that value. This param value would need to align with the zero knowledge proofs provided.

I would not imagine adding any local differential privacy.

@eriktaubeneck
Copy link
Collaborator Author

The solution @benjaminsavage references is the one presented in PRIO. It's important to clarify though, this requires a global bound on trigger values (for example we could pick the range [0,2^16) and use 16 bits.) Depending on the cryptographic details of the MPC, we may also be able to limit that just by limiting the secret shared values (and avoid the extra work of a zero-knowledge proof.)

To address the main point of your question, @juanli16, the range is in fact revealed to the aggregators (it is a global constant), but individual values are not revealed to the aggregators. And because the range is a known quantity, the distribution from which to draw the DP noise is also known.

@juanli16
Copy link

Thanks @benjaminsavage and @eriktaubeneck for the clarifications! I was starting to arrive to the same conclusion after inspecting the current state of this repo: raw-ipa. I do have 2 follow up questions if you would indulge me.

  1. With the global limit on what a trigger values can be, how do we deal with trigger values that will be out of bound? Would it make sense for the browser that is generating the trigger event with an out of bound trigger value to split the event in multiple individual trigger events, each with a trigger value that satisfies the global bound, but still sums up to the original value (basically additive secret share)?

  2. If the trigger values are very small that represent for example the number of clicks/app installs, etc. Wouldn't it be possible for the differential private noise computed based on the global bound (much larger than the trigger values) to skew the final aggregate value and lower the over accuracy of the result?

@benjaminsavage
Copy link

benjaminsavage commented Jul 12, 2022

A few thoughts:

  1. We might not have to select the "trigger value" in the browser. It might be possible to select it later, server-side. This would include some amount of work to split it up into secret shares and compute a ZKP server-side.
  2. I suspect the bound on trigger values will be scoped to the Query. So if you run 10 different queries, you could provide a different bound for each. This should help with your problem number 2. You could run one query which is just counting app-installs, and each trigger event would just have a trigger value of 1. You could run a separate query which is totalling sales, and use a much higher bound.
  3. I do not think we will have to resort to hacks like splitting an event up into multiple events. However, there will need to be a (per-query specified) total limit, of the total amount a single user can contribute to the value. This is a separate limit in addition to the per trigger-event limit on the trigger value. This is necessary to provide a differential privacy guarantee. So unfortunately, if one user makes a ton of huge purchases, their total contribution will be capped. Again - as this is a per query param, you can find your own optimal tradeoff between a high value (less capping, but more DP noise added), or a low value (more capping, but less DP noise added).
@alextcone
Copy link

Is this still where we should open issues on IPA?

@benjaminsavage
Copy link

Hi @alextcone - IPA related issues can be filed here: https://github.com/patcg-individual-drafts/ipa/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet