Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy policy discovery. #39

Open
mikewest opened this issue Sep 4, 2023 · 26 comments
Open

Privacy policy discovery. #39

mikewest opened this issue Sep 4, 2023 · 26 comments

Comments

@mikewest
Copy link

mikewest commented Sep 4, 2023

It would be ideal if sites' privacy policies were more discoverable to users, their agents, and to crawlers. To that end, I'd suggest that we:

  1. Pave the rel=privacy-policy cowpath (based on HTTP Archive data, this appears in at least 285,421 distinct documents) by defining a privacy-policy link type.

  2. Define a well-known URL that redirects to a host's privacy policy (e.g. /.well-known/privacy-policy).

There's quite a bit that could be done beyond discovery of course, but these two steps seem small, simple, and relatively easy to adopt.

I've written this up in a little more detail at https://mikewest.github.io/privacy-policy-discovery/, but there's not much to that document beyond what's written here.

WDYT?

@annevk
Copy link

annevk commented Sep 4, 2023

I think it would be useful to explain the relationship with P3P (probably none). Were there other prior initiatives like this to acknowledge?

https://microformats.org/wiki/existing-rel-values has rel=privacy, but without any reference. cc @tantek

Also, what's the intended purpose of this? On its own this seems relatively harmless, but if it's combined in some way whereby if you have this you get to share cookies across the site boundary, not so much.

@mikewest
Copy link
Author

mikewest commented Sep 4, 2023

I think it would be useful to explain the relationship with P3P (probably none). Were there other prior initiatives like this to acknowledge?

P3P was an attempt to define a machine-readable representation of a privacy policy. This proposal only defines a link to a sites' (hopefully) already-existing privacy policy prose. I think they're different in kind; I think we can learn a lot from P3P and there's good discussion to be had around it, but I don't think this proposal has much relationship to it at all.

https://microformats.org/wiki/existing-rel-values has rel=privacy, but without any reference. cc @tantek

I noted this in https://mikewest.github.io/privacy-policy-discovery/#link-type. HTTP Archive suggests that privacy-policy is substantially more popular on status quo websites, and I'd recommend we pave that path.

I also think the word "privacy" is a bit broader, and I could imagine someone wanting to define something more general that didn't link to the policy specifically but something else. 🤷

Also, what's the intended purpose of this? On its own this seems relatively harmless, but if it's combined in some way whereby if you have this you get to share cookies across the site boundary, not so much.

This is relatively harmless. :) I'm not suggesting any behavioral change, and certainly nothing with regard to storage or cookies could be justified by a link to a privacy policy. The immediate use case would be UX changes in clients (including user agents of course, but also crawlers of various sorts) that could help users discover privacy policies, not web-facing changes.

@annevk
Copy link

annevk commented Sep 5, 2023

I think it's worth pointing out P3P and saying there's no relation. At least I think that will preempt a set of concerns.

A colleague pointed out that we might also want to consider other privacy policies an origin might be responsible for, such as an Android application. Presumably we'd want to clearly scope this to websites, but giving some kind of indication what other platforms could do would be good.

@mikewest
Copy link
Author

mikewest commented Sep 5, 2023

I think it's worth pointing out P3P and saying there's no relation. At least I think that will preempt a set of concerns.

I'll find some way to make that disclaimer, thanks for the suggestion.

A colleague pointed out that we might also want to consider other privacy policies an origin might be responsible for, such as an Android application. Presumably we'd want to clearly scope this to websites, but giving some kind of indication what other platforms could do would be good.

The link type seems like it wouldn't be subject to this kind of confusion, but I agree that clarifying the purpose of the well-known redirect as being focused on the website that hosts it would be worthwhile.

Do y'all think it would be worth defining specific extensions to this (e.g. /.well-known/privacy-policy/web vs /.well-known/privacy-policy/ios)? I'm not sure it would be.

mikewest added a commit to mikewest/privacy-policy-discovery that referenced this issue Sep 5, 2023
In privacycg/proposals#39, annevk@ suggested clarifying this proposal's
relationship to P3P, and discussing the scoping of the well-known URL as
it regards non-web platforms. This patch attempts to do both.
@annevk
Copy link

annevk commented Sep 5, 2023

Potentially, providing some kind of direction if platforms want to go that way seems worthwhile. Using a /platform suffix seems pretty good, but I'd keep web at /.well-known/privacy-policy. I think this could be a suggestion at most as platforms would have to perform their own registration.

@mikewest
Copy link
Author

mikewest commented Sep 5, 2023

I agree with all of that.

I added a small note to https://mikewest.github.io/privacy-policy-discovery/#scope suggesting the possibility of this kind of extension, but I agree that it's unnecessary (and unhelpful) to use that for the web.

@annevk
Copy link

annevk commented Sep 5, 2023

Thanks! I'd expand "PWA" or simply say website.

@bvandersloot-mozilla
Copy link

Paving the rel=privacy-policy cowpath seems like a good choice to me.

The utility of a well-known link is less obvious to me, but if the scope is set to the Origin then it seems like a good option to provide to web developers.

One question: In the case where both are present, both would apply?

@mikewest
Copy link
Author

mikewest commented Sep 5, 2023

Thanks, @bvandersloot-mozilla!

The well-known link seems likely to me to be useful for non-browser clients (e.g. crawlers). I agree with you that the link type is more likely to be immediately useful for browsers.

Regarding scoping, I expect a mismatch between the well-known URL's redirect target and a link on a page to generally be a misconfiguration. I can imagine a circumstance in which the claims made about a specific page could be more strict than the claims made about an origin at large, but that seems like an edge case that I'm not sure maps to any practical use case I'm familiar with.

@bvandersloot-mozilla
Copy link

The well-known link seems likely to me to be useful for non-browser clients (e.g. crawlers). I agree with you that the link type is more likely to be immediately useful for browsers.

Thanks for capturing my thoughts better than I could! This also crystalizes a bit why the platform option may be undesirable without a small, known set of platforms.

It also opens up use for even more non-web use-cases. E.g. smtp.example.com, dns.example.com, vpn.example.com. I can't say if that would be useful, but I can't say that it wouldn't.

@mikewest
Copy link
Author

mikewest commented Sep 5, 2023

The well-known link seems likely to me to be useful for non-browser clients (e.g. crawlers). I agree with you that the link type is more likely to be immediately useful for browsers.

Thanks for capturing my thoughts better than I could! This also crystalizes a bit why the platform option may be undesirable without a small, known set of platforms.

I think it would potentially be desirable for us to define how we'd expect platforms to spell their URL (e.g. by adding their name to the path). I'm not sure it would be reasonable for us to define the meaning of any given platform name. In principle, that seems like it would require a registry (but in practice it would be a short list, so probably no harm done by codifying it).

It also opens up use for even more non-web use-cases. E.g. smtp.example.com, dns.example.com, vpn.example.com. I can't say if that would be useful, but I can't say that it wouldn't.

This is a good point that I'll add to the doc.

@bvandersloot-mozilla
Copy link

One nerd-sniping on .well-known later... I think we should only have the rel= definition.

The reasons that convinced me:

  • having a origin-level resource be the privacy policy for a given page seems like a recipe for ownership conflicts. Controlling meta tags on a page seems like a more direct connection to who dictates the privacy policy than write access to a separate path on the same origin.
  • removing a chance for misconfiguration and configuration overhead is a win
  • we would be adding a new feature where the web already built a cow path
  • Any crawler should be able to pull from static meta tags on the root of the origin anyway
  • If we are thinking only about the web much of the upside to defining a new .well-known is out of scope

Are these fair points?

@mikewest
Copy link
Author

mikewest commented Sep 5, 2023

I think your points are fair, but I disagree with your conclusion. :) Some thoughts inline:

having a origin-level resource be the privacy policy for a given page seems like a recipe for ownership conflicts. Controlling meta tags on a page seems like a more direct connection to who dictates the privacy policy than write access to a separate path on the same origin.

I agree that the distinction between page-level and origin-level declarations is meaningful, but I think they cut in the other direction. Precisely because .well-known generally has origin-wide implications (digital asset links, app store plist files, FedCM configuration, etc), it's reasonable to assume that folks responsible for an entity's policies would be more likely to have insight into the claims made there than into the content of any arbitrary page the origin serves.

Additionally, I think that creating a well-understood mechanism for declaring a set of policy constraints on an entire origin's behavior is valuable. I don't think that can reasonably or semantically be done on a resource-by-resource basis.

removing a chance for misconfiguration and configuration overhead is a win

I think you're correct to say that there's a chance of origin-level and page-level declarations pointing to distinct documents. That said, you suggest above that there's sufficiently-direct attention paid to page-level links to a privacy policy. I suggest above that even more attention is likely paid to origin-wide declarations of the same. There will certainly be cases in which there's a conflicting declaration, but given the effort that well-meaning entities put into their policy declarations, the risk of user confusion seems low in the long run.

we would be adding a new feature where the web already built a cow path

  1. We should pave the cowpath; cows probably like asphalt. So I'm glad we agree that adding the link type is reasonable. Let's do that.
  2. Adding new features is often reasonable. :) Why is the .well-known a bad new feature to add?

Any crawler should be able to pull from static meta tags on the root of the origin anyway

Yup. I agree that a .well-known redirect is additive in many cases. That said, the next bullet is relevant.

If we are thinking only about the web much of the upside to defining a new .well-known is out of scope

I think you provided good counterexamples of domains that don't serve navigable HTML documents, but for which it would be nice to expect a declaration of policy constraints on data collection and usage.

Thanks again for the feedback, @bvandersloot-mozilla!

@kdenhartog
Copy link

kdenhartog commented Sep 5, 2023

We're (Brave) very interested in this and would find it useful for making it easy to discover the privacy policies rather than needing to maintain lists of popular sites. Would there be any interest in also reusing this pattern for terms of service as well? That's another important link used during registration flows that would be useful to discover and surface within UI upon registration (the use case we're interested in this for).

The easy discovery of these two pages will be useful for UAs to be able to better assist users during registration flows and could lead to some useful additions in FedCM I'd think as well.

@mikewest
Copy link
Author

mikewest commented Sep 6, 2023

Hey @kdenhartog, thanks for your thoughts.

We're (Brave) very interested in this and would find it useful for making it easy to discover the privacy policies rather than needing to maintain lists of popular sites.

👍

Would there be any interest in also reusing this pattern for terms of service as well?

I don't see the same level of alignment around a particular link type here. Very naively skimming HTTP Archive, I only see 665 pages that contain "terms" (many more contain "tos", but generally as part of another string, like photos). Also, of those pages, many are rel="terms of service", which isn't how link types work. :)

In the absence of a clear indication of preference among web developers, I don't have any objection to adding a terms-of-service definition as what we'd like people to flock towards. Is that what you have in mind?

@kdenhartog
Copy link

In the absence of a clear indication of preference among web developers, I don't have any objection to adding a terms-of-service definition as what we'd like people to flock towards. Is that what you have in mind?

Yup, that's exactly what I had in mind.

@othermaciej
Copy link

Why specifically is the .well-known version more useful to crawlers? Do crawlers not parse HTML as they go?

I’m asking because having two distinct ways of specifying the privacy policy creates the possibility that they may be in conflict. That means we have to specify which one takes precedence if they are different. If it’s the rel link on the page, then crawlers will have to read that anyway. If it’s the well-known URL that takes precedence, then browsers will have to read that anyway and not just trust the rel. If they are required to always be the same but with no enforcement or defined precedence, then that creates the potential for confusing or deceiving users, if for example privacy policy UI showed different things in browsers and in services that obtain their content by crawling. All these options are kind of bad, so it would be better if there is only one way to specify. But having a defined precedence and having the rel link take precedence is probably the least bad possibility (b/c it makes more sense for the specific to override the general, and better to have it defined that any client potentially has to check for both than to incorrectly imply that either will do and they are guaranteed to be the same).

@mikewest
Copy link
Author

mikewest commented Sep 7, 2023

Hey @othermaciej, thanks for the feedback!

While I think that sites generally have a single policy document they point to for their behavior on a given platform, I agree that the concern you and @bvandersloot-mozilla raise is reasonable. If we end up deciding that the link type is the only thing we need, great. :)

Broadly, I have three kinds of answers for you:

  1. .well-known files can be discovered on hosts that don't otherwise serve HTML (analytics services, for example).

  2. The cases of conflict you're worried about will occur even if we only have link annotations. A page might have a <link> in its <head> that points to one policy while an <a> in the footer points to another, another pages might have multiple <a> elements, etc. .well-known has the advantage in this case of being singular by design, but even in that case servers could send one Location to crawlers and another to users.

  3. The specificity and scoping of a .well-known file and <a rel="..."> differ meaningfully in ways that I think are helpful. Because .well-known is scoped to an origin as opposed to any particular resource on it, it seems reasonable to me to establish the expectation that it can meaningfully be interpreted as demarcating an outer boundary of data collection and usage behavior for all the origin's resources. Individual pages might tighten that policy (e.g. "Data collected via this specific application on site.example will not be used by other applications on site.example."), but I think it would be strange to broaden policies on a resource-by-resource basis.

So, I agree that the conflicts you're pointing to can and will happen. This seems to me like a policy problem whose risk we can mitigate at that layer by defining expectations around this mechanism more clearly, and relying on non-technical actors in the ecosystem to help us create incentives for correct usage.

@samuelgoto
Copy link

samuelgoto commented Sep 7, 2023

Paving the rel=privacy-policy cowpath seems like a good choice to me.

+1

I don't have any objection to adding a terms-of-service definition as what we'd like people to flock towards.

+1, I think a terms-of-service definition would be helpful too!

The easy discovery of these two pages will be useful for UAs to be able to better assist users during registration flows and could lead to some useful additions in FedCM I'd think as well.

Just to try to add some clarity here, FedCM has defined an un-crededentialed client_metadata_endpoint, discoverable through the /.well-known/web-identity file, that IdPs expose that, given a clientId (an OIDC/OAuth term that refers to a pre-registered client -- the calling site, as an approximation), returns a json payload containing:

{
  "privacy_policy_url": "https://rp.example/clientmetadata/privacy_policy.html",
  "terms_of_service_url": "https://rp.example/clientmetadata/terms_of_service.html"
}

This is currently used in the FedCM UI on sign-up (when the user is creating a new account on the website).

This mechanism is different from what's being proposed here in a few ways:

  • first, it is not self-discoverable by other Web Platform APIs: you need a clientId, which is unfortunate, i think
  • second, it is not self-declared by the origin, but rather asserted by the Identity Provider, which I think is also unfortunate

It is hard to tell with a lot of confidence right now, but I have an intuition that the mechanism proposed here could be made to augment the FedCM UX indeed, as @kdenhartog points out.

Why specifically is the .well-known version more useful to crawlers? Do crawlers not parse HTML as they go?

One consideration here is that, for browsers (as opposed to crawlers), loading and checking for the existence of a ./well-known/privacy-policy seems to be much cheaper than loading index.html and parsing it to find the <link rel=privacy-policy>, so may allow us to discover that synchronously rather than asynchronously (e.g. at crawling / pre-processing).

mikewest added a commit to mikewest/privacy-policy-discovery that referenced this issue Sep 8, 2023
@kdenhartog
Copy link

discover that synchronously rather than asynchronously (e.g. at crawling / pre-processing).

That's what we had in mind. One of the use cases here is that browser UI could call out the ToS/privacy policy links to make these easier to find. The other thing that we're interested in experimenting with here is using a local LLM to be able to parse the privacy policy and terms of service and flag any concerning issues to the user. Obviously LLMs are a bit finicky for something like this so it's more just an experiment at this point, but having to manually maintain a list for this seemed a bit more of a headache than something like this as an option.

mikewest added a commit to whatwg/html that referenced this issue Oct 24, 2023
This PR defines a `terms-of-service` link type that refers to a document
which contains information about the agreements between a document's provider
and users who wish to use the document provided.

This link type was initially discussed in privacycg/proposals#39
and initially sketched in https://mikewest.github.io/privacy-policy-discovery/.
@bvandersloot-mozilla
Copy link

Why specifically is the .well-known version more useful to crawlers? Do crawlers not parse HTML as they go?

One consideration here is that, for browsers (as opposed to crawlers), loading and checking for the existence of a ./well-known/privacy-policy seems to be much cheaper than loading index.html and parsing it to find the <link rel=privacy-policy>, so may allow us to discover that synchronously rather than asynchronously (e.g. at crawling / pre-processing).

A browser would looking in the current page for the appropriate link type, rather than loading another resource would be the most appropriate step to take.

One of the use cases here is that browser UI could call out the ToS/privacy policy links to make these easier to find.

This is exactly the use case for a rel-link. Forcing the UA to go out of band to double-check that a well-known resource doesn't exist before being certain that a privacy policy doesn't exist for a page is undesirable.

@mikewest
Copy link
Author

I agree with @bvandersloot-mozilla that browsers currently rendering pages are quite likely to be able to extract metadata like this from the page they're currently rendering. I likewise agree that .well-known URLs wouldn't be necessary for browser UI associated with the currently rendered page.

That said, there are certainly UI use cases that benefit from out-of-band checks with scope broader than an arbitrary document on an origin (see e.g. https://www.w3.org/TR/change-password-url/), and I think the conversation above has outlined even some in-band cases in which a <link> or <a> is unlikely to be available (API servers, etc). Is my intuitions around those cases incorrect?

@mikewest
Copy link
Author

(It might be worthwhile to split the .well-known discussion off into mikewest/privacy-policy-discovery#3 for folks who are interested.)

@npdoty
Copy link

npdoty commented Oct 25, 2023

I don't see the same level of alignment around a particular link type here. Very naively skimming HTTP Archive, I only see 665 pages that contain "terms" (many more contain "tos", but generally as part of another string, like photos). Also, of those pages, many are rel="terms of service", which isn't how link types work. :)

In the absence of a clear indication of preference among web developers, I don't have any objection to adding a terms-of-service definition as what we'd like people to flock towards. Is that what you have in mind?

RFC 6903 defined terms-of-service as well as privacy-policy. I don't know that this caused any particular adoption, but it's one additional data point of prior art, and it would be less confusing to developers if we encoded the same values in subsequent documents.

@npdoty
Copy link

npdoty commented Oct 25, 2023

I identified four existing ways of doing this when we last discussed it in ... 2013, apparently.

For the use case of researchers or civil society who might want to discover and compare privacy policies en masse, there might be some advantage to .well-known and equivalents, but it's also not hard to discover those from link relation elements that we would often expect to find on the home page of a large number of domains.

@mikewest
Copy link
Author

RFC 6903 defined terms-of-service as well as privacy-policy.

Thanks for the link, Nick! I didn't realize this document existed (and I'm embarrassed that I didn't think to look at the IETF for link type definitions...). Thanks also for the pointer to earlier discussion. I'm glad the proposals here landed on the same names, and I'll update my doc and PRs to point to that document instead.

For the use case of researchers or civil society who might want to discover and compare privacy policies en masse, there might be some advantage to .well-known and equivalents, but it's also not hard to discover those from link relation elements that we would often expect to find on the home page of a large number of domains.

From talking with folks like https://checks.google.com/, establishing and encouraging a pattern through which a HEAD request was enough for discovery could be a small-but-noticeable reduction in cost in comparison to rendering homepages, JavaScript and all.

domenic pushed a commit to whatwg/html that referenced this issue Nov 6, 2023
The privacy-policy link type that refers to a document which contains information about the data collection and usage practices that apply to the current context.

This link type was defined in section 4 of RFC 6903 (https://datatracker.ietf.org/doc/html/rfc6903#section-4), and rediscovered in a discussion at privacycg/proposals#39.
domenic pushed a commit to whatwg/html that referenced this issue Nov 6, 2023
The terms-of-service link type refers to a document which contains information about the agreements between a document's provider and users who wish to use the document provided.

This link type was initially defined in RFC 6903 section 5 (https://datatracker.ietf.org/doc/html/rfc6903#section-5), then rediscovered in a discussion in privacycg/proposals#39. See also https://mikewest.github.io/privacy-policy-discovery/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants