-
Notifications
You must be signed in to change notification settings - Fork 90
transparency to users about cohort interpreted meanings #101
Comments
Yes, the UX question of how to offer good transparency into the meaning of cohort IDs is extremely interesting. In part this is bound up in the question of how a cohort is calculated. The clustering algorithm in the current Origin Trial ( But that all completely changes if cohorts are, for example, clusters based on topics you're interested in (whether entered by the user or inferred by the browser).
The "some confidence" part here is the tricky bit, of course. We could let consumers make claims about what they infer, but if we're worried about malicious use of the signal then I don't think this helps.
Very interesting. Is there any web platform precedent? Obviously we don't want to risk going the way of P3P, where everyone makes a required disclosure that says "we do not make a required disclosure". |
While some threats of cohort identifiers come from the use by malicious actors, there are also non-malicious consumers where transparency could be meaningful to a user.
I tend to think that inferences from the history of P3P compact policies has been overstated. However, it does seem likely that if disclosures were required in an automated way but never checked or used in any other way that there would be an incentive towards false or empty disclosures. Here's one relevant proposal I read recently that connects a machine-readable policy assertion to further access to fingerprintable APIs:
Happy to connect you and @bslassey if that would be helpful ;) |
Heh, yes, I've met that @bslassey guy. And certainly I'm open to policy assertions having a role to play, especially when past the limits of where we can rely purely on technical protections. The assertions relevant to Willful IP Blindness are of the form "We don't do X", and they can be backed up by an audit that is plausibly able to tell whether the server actually does X or not. So what is the analogy here? It seems to me that your "parties who disclose their inferences from it" would have to mean something like: Any party that uses FLoC must host an endpoint that responds to request of the form https://adtech.example/wtflock_is/12345.chrome.2.1 with a best-effort human-readable explanation of what they think That sounds like a fascinating effort in ML Explainability. But the parallel to P3P does seem warranted here: the browser (or a human auditor) could check that these pages exist, but I don't know how to tell that the information on those pages really is responsive to the question being asked or truly embodies what the party believes. Is this along the lines of what you're imagining, or am I completely off base here? [edit: corrected to @bslassey instead of blassey who is probably confused about why I claimed to know her 😳 ] |
Revealing cohort inferences might also reveal sensitive cohorts that are not pre-screened by the browser, including cohorts that are not sensitive in the context they're used but may be sensitive in other parts of the world. (For example, a California supermarket might not flag beef or pork shoppers as sensitive cohorts.) Related: #71 |
If cohorts are based on browsing history, isn't the meaning of cohort is just a set of sites that contributed to it? Any meanings like "shoe lover" are not revealed by a cohort itself, but learned in an experiment. Where experiment is "try to show that cohort some ads and see what happens", and so meanings will be DSP-specific (or "party-that-uses-FLoC"-specific in general). |
Something along those lines, yes. We could speculate on other designs (e.g. publishing a full list rather than querying a single value, or returning a set of information about how an ad was targeted that included but wasn't limited to the interest cohort interpretation).
I don't know that any groundbreaking research work is needed here. For one example, Google provides an Ad personalization interface (maybe previously known as the "ad preferences manager") with at least some of the inferences that they've drawn about a logged-in user from their browsing activity (and other data sources), presented in human-readable text. If the only use-case for interest cohorts is targeting advertising to groups based on inferences about those groups, then it seems that consumers should, definitionally, be able to provide the inferences they are acting on.
Auditing would be required and auditors would need some access to internal systems to have confidence about how the data is being used, but that seems very similar to the access needed to confirm IP blindness. While completely external auditing (just confirming that pages exist and that they seem to return plausible or consistent results) may not provide the same level of confirmation, having that information documented and actually exposed to end users (as opposed to the brief P3P CP historical example) could provide a hook for regulatory intervention if that information were falsified. |
I think that's too limiting a notion of "meaning". Consider a somewhat extreme version of cohort assignment, in which all FLoC does is give each person a random cohort ID, shared with 2000 other people. Now there is no information about you that contributed to the cohort assignment, so you might think it automatically has zero meaning. But now suppose 5% of people are interested in motorcycles. Then in each cohort, you'd expect 100 motorcycle enthusiasts on average, but it might be higher or lower just due to random chance. The probability of a given cohort having >133 is about 0.05% — so if there were 64K cohorts total, then around 30 of them would "mean" that a person in that cohort is 1/3 more likely to be into motorcycles than people in general. Of course we don't intend to assign cohorts randomly, we intend them to be influenced by browsing behavior. So maybe that leads to a situation where some cohorts contain 10% or 15% motorcycle enthusiasts, 2x or 3x the base rate. From an advertising utility point of view, that's probably useful — I'd expect motorcycle companies would like their ad dollars to be twice or three times as effective. But even in this case, does being in such a cohort "mean" that you are into motorcycles? Of course not; indeed 85% or 90% of the people in the cohort are not. @npdoty How does this example map onto your disclosure ideas? |
I think at this point we don't know what the probabilities are likely to be for various categories, under the domain-hashing approach or under alternative algorithms. I don't think it eliminates the privacy concern to say "we don't know with a high probability that you are into motorcycles, but we do think you are X% more likely than the general population to be into motorcycles". People have privacy concerns about incorrect or uncertain conclusions made about them. But this is a useful explanation of why the inferences to marketing categories are not identical to the list of relevant domains -- in some cases users might easily understand (ilikemotorcycles.com) but in other cases they may very much not. Going further, if the ad preferences categories were displayed in the browser UI (or even explicitly selected by the user as intentional interests), that might narrow that gap significantly. If a user selects {"interested in motorcycles", "shopping for men's apparel", "home improvement"}, the user may realize that some consumers of that information will draw additional inferences, some correct and some incorrect and some the user can anticipate and occasionally some the user didn't anticipate at all. But those meanings are more transparent and useful than domain lists or numeric codes. |
Yes, if the browser had an intrinsic way of associating cohorts with interests (or if it observed interests and derived cohorts from those), this would be a very powerful approach. My example with motorcycles vs. random cohort assignment was intended to show why this kind of explanation might be wholly impossible using on-device info. But of course the truth is somewhere in between. |
One way to support user privacy is to provide transparency to a user about what information they are disclosing using a particular technology.
On-device, in-browser learning could allow for an improvement in that kind of transparency -- a user can see exactly what identifier is being calculated and transmitted. However, opaque identifiers calculated from browsing history don't make that kind of inspection easy. (The explainer does argue that short names will show the user that "they cannot carry detailed information" -- a bold but dubious claim!)
Furthermore, if the design intends for separate parties to collect cohort IDs and browsing histories (perhaps combined with other data) and sell mappings to marketing sectors, the meanings could differ based on the recipient. This is true of many status quo systems, of course, but there is some existing work in ad management interfaces to disclose (with varying levels of detail) to the end user the inferences about them.
For example, testing with Chrome just now I can see that my cohort ID is 4724. What have I just revealed to you all? Could a browser (or some other party) give me some confidence about what is likely interpreted about my history from my id? Could access to my browser-generated cohort ID be limited to parties who disclose their inferences from it?
The text was updated successfully, but these errors were encountered: