Skip to content
This repository has been archived by the owner on Mar 16, 2023. It is now read-only.

Systems of equations attack to recover visited websites? #112

Open
lknik opened this issue May 7, 2021 · 2 comments
Open

Systems of equations attack to recover visited websites? #112

lknik opened this issue May 7, 2021 · 2 comments

Comments

@lknik
Copy link

lknik commented May 7, 2021

I'm building on this issue by @johnwilander.

In https://github.com/WICG/floc/issues/99, it is stated that "FLoC is not useful for tracking." I don't think that's accurate.

As far as I know, the user's cohort will not be partitioned per first party site so multiple sites can observe the cohort ID in sync as it changes week after week. A hash of the cohorts seen so far will likely get more and more unique as the weeks go by.

Websites or tracker scripts on websites can expose arrays of the cohorts they've seen to help all trackers identify the user, like this:

let cohortCollectionForWebsiteA = [
  "week01_2022" : "0666",
  "week03_2022" : "A566",
  "week04_2022" : "2111",
  "week05_2022" : "1171",
  "week07_2022" : "749B",
]

let cohortCollectionForWebsiteB = [
  "week01_2022" : "0666",
  "week02_2022" : "0030",
  "week05_2022" : "1171",
  "week06_2022" : "7311",
  "week07_2022" : "749B",
]

Trackers send these to a server for matching across websites, in the example above, resulting in the intersection [ "week01_2022", "week05_2022", "week07_2022" ].

Just a slight thought.

What if we consider a slightly different threat/exploitation scenario (unless it's simply a flavour of tge remarks quoted n the above, which is why I retain them here), which are linked to the risks I already pointed to. Specifically reversing the cohort ID to obtain the actually visited websites?.

So the idea to hypothetically improve such reversal could follow a reasoning where we know that the sets of visited sites in week_i (i = given week of the year) corresponds to a specific ID_i. The computation of the FloC is based on input (website addresses). So I wonder if it would be possible to mathematically construct a system of equations of the form:

FloC(sites_1) = ID_1,
 
...,
 
FloC(sites_i) = ID_i

And then use the properties of SimHash to obtain the visited websites or at least improve the inference of the visited websites. Note, I did not focus on the analytical solution so I do not know the circumstances when such a system of equations would be solvable. It would be interesting to consider it in the threat model, though, so I leave a proof exercise to the proponents. Thank you.

@michaelkleber
Copy link
Collaborator

Hi Lukasz, I don't quite understand what you're proposing here.

For a single FLoC cohort, per the data here, there are hundreds of different browsing histories that lead to being in the same cohort (at least 735, for the "Chrome.2.1" clustering algorithm). It seems like you're trying to make a stronger inference by looking at one person's sequence of cohorts over time. Do you think the addition here is some mathematical model of the likelihood of browsing history week-to-week consistency vs variability?

From a Bayesian point of view, each time you observe someone's cohort, you can update your belief that the person has visited some particular website. So before I've seen you at all, my guess at P(you visit facebook every week) is some baseline probability p0, say 44% if you're in London. Maybe after seeing someone's FLoC once, that would change to 51%, and after seeing their FLoC every week for a year it would be 58%. That could be influenced by some sort of algebraic system-of-equations approach, or just by observing the behavior of the people in a FLoC. I would expect the observational method to be better, because the algebraic one would run into a skyrocketing number of unknowns about all the other sites someone might have visited.

Or am I completely missing your point here? Obviously all these numbers are totally made up, but they represent the kind of information extraction that I can imagine.

@lknik
Copy link
Author

lknik commented May 7, 2021

For a single FLoC cohort, per the data here, there are hundreds of different browsing histories that lead to being in the same cohort (at least 735, for the "Chrome.2.1" clustering algorithm). It seems like you're trying to make a stronger inference by looking at one person's sequence of cohorts over time. Do you think the addition here is some mathematical model of the likelihood of browsing history week-to-week consistency vs variability?

Yes, I wonder what would be the impact of observing FloC IDs for several week on possible discovery of the browsed sites. Specifically - does it decrease the likelihood of certain collisions for specific users?
Also, what if, say, we have a user (an assumption) with a pretty limited number of visited websites per week (knowing that histories are fairly stable).

From a Bayesian point of view, each time you observe someone's cohort, you can update your belief that the person has visited some particular website. So before I've seen you at all, my guess at P(you visit facebook every week) is some baseline probability p0, say 44% if you're in London. Maybe after seeing someone's FLoC once, that would change to 51%, and after seeing their FLoC every week for a year it would be 58%. That could be influenced by some sort of algebraic system-of-equations approach, or just by observing the behavior of the people in a FLoC. I would expect the observational method to be better, because the algebraic one would run into a skyrocketing number of unknowns about all the other sites someone might have visited.

It's a fair way of seeing things, though I did not have this one in mind. But the ideas could maybe help in the above hypothesis.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants