HEY! JARG HERTA THORIN - background - RFP (resistFingerprinting) is Tor Browser's upstream fingerprint protection in Firefox
Before we talk about fingerprinting, we need to know - its purpose - how that constitutes a threat. purposes - include preventing fraud (device authentication) and detecting bots - these are important to keep in mind as we add resistance, as we don't want to cause adversarial impacts. i.e we don't want users getting repeatedly flagged as a bot and denied services or entry to websites the bad news - fingerprinting is also used to facilitate web tracking - and tracking is what we're focused on here
This presentation is in two approximately 40 min halves. There's a 15 min break in the middle and time at the end for any questions. We'll start with understanding what tracking and fingerprinting are. Then after the break move onto how resistance and testing work.
from wikipedia: web tracking is quote "The practice by which operators of websites and third parties collect, store and share information about visitors' activities on the World Wide Web." - end quote This information can be collated to build profiles A profile could be based on a device, a browser, an IP address, to name some .... or any combination of them. Profile data points or metadata can de-anonymize a person. - e.g. a 2019 paper on Americans found 99.98% would be correctly re-identified in any dataset using 15 demographic attributes Tracking is not just creepy and none of their business, but the threat is that it can easily deanonymize users
There are a few different tracking vectors, for example - navigational tracking - correlation/timing attacks - IP Address - logging in, including with a <code>SSO</code> (Single Sign-On) - i.e giving away an id - state tracking - ... and more we will focus on state tracking to illustrate how tracking works.
STATE is anything stored client-side (your browser) that is created and persisted by a website - it can be written to disk or stored in session memory. Each website or tracker can ask for and recieve it's own state as long as it exists. A typical example is a "cookie". 1st party vs 3rd party 1st party = the website you visit 3rd party = all other connections providing or facilitating content such as scripts, images, media, discussion boards, avatars, and adverts First party tracking would identify you as a repeat visitor, but third party or cross-site tracking would be able to link your traffic for the sites that tracker is on
Here's how cross-site tracking works. All the "cookies" are in a single "cookie jar" Take sites A B and C. On all three sites the same 3rd party tracker exists. The first time you meet the tracker, it will assign you a cookie and an ID Each time you visit A B or C (first party), the tracker (3rd party) can look up it's cookie, get your id, and then use your id to link your traffic, date and time of the visit, ip address: i.e profile building
On the right we have a list of items that are STATEFUL. We can break these down into two sets: site data and the rest - site data or storage APIs are - cookies, localStorage, indexedDB, etc - e.g. in firefox if you "clear cookies and site data" or choose "forget about this site", these are the items it would clear. we call this sanitizing - And this is the way to "reset" the tracker (lose your tracker id), is to sanitize i.e you clear the tracker's "site data" - the rest - e.g. various caches, etc - all of these (i.e non site data) have been either used or shown they can be used to re-identify users, even if they sanitize because these aren't sanitized, they are not site data. In other words a tracker can use both site data _AND_ something else that is stateful to hold your id, and should you sanitize, it can recreate it - this is known as a zombie cookie or supercookie
Over the years a number of tracking mitigations have been developed These include - masking your IP address (VPN / Tor) - Sanitizing (clearing of "cookies and site data") - e.g. Incognito and Private Browsing windows sanitize on close - Other - blocking known trackers - changing or new web standards - deprecating (or blocking) 3rd party "cookies" And then we have State Partitioning - Tor Project and Mozilla developed what is known as FPI (First Party Isolation) specifically for Tor Browser - a version of this is now used in most browsers - see previous pic State partitioning means every website (first party) would have everything required for that site isolated in it's own container
Let's look at that cross-site STATE tracking now everything is partitioned - that big list, all the cookies and site data AND all the other state used for zombie cookies - every first party or website has it's own cookie jar Take sites A B and C. On all three sites the same 3rd party tracker exists. The first time you meet that tracker ON EACH SITE, it will assign you a cookie and a DIFFERENT id The tracker cookie is no longer shared across websites. The 3rd party tracker is now a first party tracker
- IP address is protected and rotated - ALL state including zombie cookies are partitioned - we have sanitizing (for sessions)
unfortunately we have not won - in the previous section we talked about mitigating state tracking, say hello to stateless tracking Your browser provides information, consistently and on demand. Because it is stateless, sanitizing has no effect, partitioning has no effect, and to make matters worse, you have little to no control over it as a user. In this part we'll explore what fingerprinting is - we need to understand it in order to resist it - think evil to defeat evil.
there are two general types passive or server-side - traffic originates from the browser and is sent automatically - e.g. HTTP Headers, SSL/TLS (ciphers), TCP/IP (stack), IP address, requests for assets active or client-side - the traffic originates from the fingerprinter - e.g. JavaScript or CSS
Have you heard of "Where's Wally?" Here he is. In the game Wally is unique. What makes him unique are a set of characteristics. - red and white bobble hat - brown hair - glasses - red and white stripped top - blue pants - walking cane - brown boots
Think about how you search for Wally. You scan the image for a distinctive red and white bobble hat. When you find one you would check the hair, then the glasses and so on. If you find a non-match, you stop checking and you start over with the next red and white bobble hat. Eventually you find Wally. But fingerprinters are not trying to find Wally, they're trying to find everyone. So unlike "Where's Wally" where you stop checking after something doesn't match, fingerprinters treat all users the same, check everything, and record everything. A fingerprinting script will typically get 50 to 100 metrics (depending on how you count them) in 100ms or less.
The combined metrics (information) create the fingerprint
We like to use the term "bucket" - it's grouping by metric or fingerprint values - Let's take Tor Browser desktop as a simple example - 3 userAgents (one each for Windows, Mac, Linux) - 8 likely screen dimensions with letterboxing - 41 languages 984 (3 x 8 x 41) potential buckets _so far_ and we've only checked three metrics, and those are metrics we've added fingerprint resistance to! Every additional metric that differentiates any browser users multiplies the potential buckets. Even a binary choice such as prefers-color-scheme (light or dark) doubles it. Keep in mind, with many more metrics added, not all potential buckets can exist (some can be parodoxical), and not all the remaining potential buckets will have users. The point is the number of combinations grows very rapidly
A 2016 study showed that 99.24% of users could be uniquely identified, and that was without the IP address. The metrics used were limited and the script was not sophisticated It's now 9 years later, and fingerprinters haven't been sleeping - scripts have only gotten faster, more common, more sophisticated, more accurate, and more effective (and users have little to no control) If you do nothing, you're going to be uniquely identifiable given a good enough fingerprinting script. Fingerprinting is also a a growing threat. - top of my head, don't quote me: in the space of about 7 years it's estimated that of the top 10k sites, the percentage utilizing fingerprinting has gone from about 1% to 25% - and a new recent study has shown the methods to scrape and calculate this is missing up to half of them - there is also huge money being thrown at it due to browser mitigations and the rise of bots etc (e.g. AI scrapers) - e.g. fingerprint.com's revenue grew over 3.5 thousand percent last year (2024)
when people or articles talk about fingerprinting, they talk about entropy. So what is it? Ask 10 different people (mathematician, physicist, statistician, etc) what entropy is and you'll get 10 different answers. Basically, it is information theory. We don't really like to think of or use entropy as such, we prefer to think in buckets and uniformity. If a metric splits any users into more buckets, then it adds entropy. A fingerprinting script will get as much information as it thinks it needs, by testing metrics that add entropy/buckets. Here's the algorithm used to calculate the entropy And here's my definition of entropy: Entropy is the average uncertainty (using probabilities) represented in the least amount of bits The higher the probability, the lower the entropy - e.g. if it's 100% certain (i.e. the probability is 1) then it's not adding any new information, so the entropy is zero - e.g. if the probability is less than 1, i.e there is more than one choice, then there is uncertainty, and the lower the probability, the higher the entropy 2 to the power of 33 is 8.59 billion (which is more than the number the humans on the planet), so it is said that if you have 33 bits of entropy then theoretically you are unique. This is not a good analogy: not everyone is online or using browsers, and we're not identifying people, we're identifying browser profiles (multiple devices, multiple browsers and profiles) But you get the idea.... high entropy bad, low entropy good .. high bucket numbers bad, low bucket numbers good
Lets look at an example. And being the super nice guy I am, I've done all the math for you. Imagine a metric that can only be true or false - 100% true: 0 bits of entropy - 50% true and 50% false: **1** bit of entropy (**maximum**) - bits of information: both **1** - 25% true and 75% false: **0.815** bits of entropy (**average**) - notice it is **lower** than the uniform case - bits of information: true **2** | false **0.42** Ideally we want uniform or maximum entropy. But wait, that's a higher average entropy! This is true, but the lower the probability, the higher the bits of information, so when we have smaller and smaller "buckets", they rapidly gain higher bits of information. In other words **average** entropy isn't all that helpful, when it comes to what I like to call the long thin tail. Lots of users in big buckets, great ... down to smaller and smaller buckets of users getting very quickly disportionally disadvantaged Unfortunately, we can't always control unformity, but it is something we should strive for
If a metric doesn't add any **new** information, it's equivalency. e.g. Mac | _-apple-system_ font In other words, it is 100% certain (100% probability) that if you are on a mac you have the -apple-system font, and if you are not on a mac you won't - therefore the entropy of having this font, or not having it, is zero there are metrics that cannot be hidden: e.g. - the browser engine (gecko, blink, webkit, edgeHTML, trident) - browser version (feature detection and changes that cannot be user controlled) - platform (Windows, Mac, Linux or Android) - font enumeration (a font either exists or is aliased to one that exists, or it doesn't exist or is blocked) there are metrics you cannot "lie" to websites i.e the web-content uses those values and this affects the user experience: e.g. - web-content languages: no point requesting english when you only speak arabic there are metrics that are impractical or currently not feasible to do anything with These "base" metrics are a known limitation. When a metric only reinforces that, i.e it doesn't add any new information it's equivalency
Websites want fingerprinting scripts to be: - Seemless - no unexpected events like prompts alarming users - no breakage, unhandled errors etc - Performant - fast and not affect the website's performance - e.g. the script can be fetched/run after other events - Unblockable: they want the advertising revenue Fingerprinters want their scripts to be universal - the more widely spread a script is, the more traffic it can easily link. Fingerprinters also want their metrics to be: - Robust - handle any browser, any browser configuration, and any extension - Correct - produce a correct result - Reproducible - produce the same result when run again or in a new session (without any browser/OS changes) - Stable - the more stable a metric, the _easier_ to weight, or to link over time - A quick noite about stability: a metric or fingerprint is a snapshot in time and doesn't HAVE to be stable - metrics can change for a number or reasons - e.g. in a session: such as zoom, resizing windows - e.g. over time: such as updating the browser changes the userAgent, installing new system fonts, updated graphics drivers etc - metrics can be weighted and changed server-side - e.g. an IP addresses can be recorded in full and an additional property added server-side to reflect "tor node", "MullvadVPN", "proxy" etc - there is a whole science dedicated to linking changing fingerprints over time - Add Entropy - a metric that adds no entropy is redundant - Universal - the more universal a metric (and it's name and format), the easier it can be used by data brokers to link fingerprints and profiles
Fingerprinting resistance relies on a "crowd" (so users can have shared fingerprints). A crowd is a set of users being protected by an overall fingerprint resistance strategy: such as Tor Browser users with RFP and additional patches, or Brave users using Shields' fingerprinting protections. There is no such thing as "no fingerprint". There is always fingerprint data even without JavaScript. And the fingerprint protections _and techniques per metric_ in each "crowd" are themselves fingerprintable. There is also no such thing as "defeating fingerprinting" - This is why RFP is called _resist_ fingerprinting and not _defeat_ fingerprinting. It is an ongoing process and not a zero sum game And there is no such thing as "a single fingerprint" for all users in a crowd - there are things we cannot hide or lie about, so there will always be differences
There are five steps fingerprint resistance can take - 1. block known scripts - the best fingerprinting code is no code, but this is easily bypassed, a form of enumerating badness, is inherently reactive, and can undermine legitimate uses causing breakage or rejection (anti-fraud and bot-detection) - 2. engineer solutions to remove the fingerprint problem - e.g. Firefox ships, for all users, the same same math library in audio across all platforms. This removes the audio entropy caused by floating points, and the only differences left are equivalency of platform architecture (which we can't hide) - 3. help change existing and shape/reject new web standards - 4. grow the crowd: the bigger the crowd hopefully the more users you share a fingerprint with to help hide your traffic in - 5. reduce buckets
To reduce buckets, we reduce/limit the values returned (e.g. timezonename) or the resources used to determine values (e.g. fonts). When we do this, we need to keep in mind the types of values we return - An adversarial fingerprint or value is one that differs from benign (i.e known real ones) - e.g. there are a finite number of benign audio results per browser engine and platform, which fingerprinters know - so changing those e.g. by randomzing that metric, it stands out - e.g. returning touch capabilities on a Mac - there is no such thing (yet) - Adversarial results also include paradoxes where metrics conflict - e.g. returning "Windows" in the userAgent, but having an "-apple-system" font Adversarial fingerprints can trigger anti-fraud and bot-detection scripts, so we should always use benign **values** - note: this does not guarantee an non-adversarial outcome, e.g. some metrics we cannot fully protect, so if a script also detects the real value, they cause a paradox
besides blocking known scripts, engineering solutions, shaping web standards, and growing the crowd, we can reduce buckets Fingerprint resistance reduces buckets per metric by - protecting the **real value** - limiting the values returned (e.g. userAgent) or the resources (e.g. fonts) As more metrics become resistant, this makes it harder and more costly for fingerprinters to gain enough buckets - they will need more methods, more tests, more metrics - and they will take a performance hit Resistance must come from built-in browser solutions. - Extensions are not suitable as they - often lack APIs to properly protect the real values - they have no default crowd - They can also provide additional bits of information (entropy) with prototype and proxy and other detectable tampering, can create performance issues and cause website breakage, and are likely to trigger anti-fraud and bot-detection scripts due to adversarial fingerprints The function of fingerprint resistance is to ultimately lower buckets for it's crowd. Raising entropy by randomizing results can always be detected in a crowd (i'll cover this later), and is treated as reducing buckets, given any script could detect it.
When limiting values there are two outcomes: they are either truthful or spoofed (lies). Both are technically "breaking" web standards by not returning the original or expected (**real**) values from a user perspective. But that is not always the case in the context of what websites and servers can see.
- limitations are **robust** - fully implemented to cover **all methods and sources** - methods: all ways to determine or infer the value - sources: all origins: document, iframe, workers, service workers etc - therefore it is always "truthful" - disabling an API is also considered a limitation and truthful There should be no difference in any fingerprinting between using a **real** value vs the same **limited** value - e.g. changing your system's timezone to Atlantic/Reykjavik vs enabling RFP on any other system timezone - any timezone related metrics should be identical: such as timezonename, timezne offsets, the timezonename component in formatted dates, accessible timestamps
disabling an API should be a last resort, as it can lead to website breakage as scripts may expect a web standard to exist Examples - geolocation is behind a prompt so is a user choice. however allowing location data is considered by tor browser to be too risky, so the API is disabled. All users are the same on this metric font enumeration - limit available fonts per platform (i.e Windows, Linux, Mac, Android) to system fonts (provided by the platform) - limit the available system fonts as much as possible to those expected across _each_ platform (e.g. for Windows cover Windows 10 and 11) but try and cover modern writing systems - bundle fonts with the browser (to cover writing system gaps) - e.g in desktop Tor Browser most fonts are bundled to provide comprehensive coverage for writing systems. Some system fonts are also allowed on Windows and Mac, especially for CJK (Chinese, Japanese, Korean) to save on package sizes, and others for platform consistency ("looks and feels", e.g widgets) - all users are (hopefully) the same per desktop platform, which is equivalency
- limitations are **not robust** - or the web standard is broken (i.e not truthful, e.g. canvas) - results are therefore always "lies" _**Spoofed** values are adversarial if exposed_ Because these types of limitations can be adversarial, we need to weigh the pros and cons of implementing it these are usually physical device contraints ( e.g. devicePixelRatio, screen resolution) or our solution broke the web standard (i.e not truthful e.g. canvas where we alter what is read back from a canvas and this is detectable)
> Examples (spoofed) > > - hardwareConcurrency > - Tor Browser currently returns 2 for all users (edit: since changed) > - This does not prevent the browser from using all available cores, which can be estimated with workers and timing attacks > - screen dimensions > - Screen spoofs are typically based on a combination of screen and window dimensions for plausibility > - But the real screen resolution can be exposed (or inferred) if a user goes fullscreen (F11) or uses a fullscreenElement (e.g. clicking to view a video in fullscreen)
Fingerprinting resistance can lead to undesirable side-effects or outcomes. Solutions should aim to provide a good user experience as much as possible In other words: try not to "break" anything - always use **benign** values - preferably be **truthful** i.e robust - covering all **methods** and **sources** (just like a **real** value) - only **spoof** if required, knowing that it could be detected and be adversarial - try not to give scripts any reason to break things with disabled APIs - be mindful of usability and accessibility issues The end result is for users to enjoy a seemless experience - and happy users maintain and grow the crowd.
In resistance terms, it does not matter if a metric's fingerprint resistance is random or static, as long as the real value is protected. But there are always pros and cons Randomizing is said to "raise entropy" - being different each time makes your fingerprint unstable and fingerprinting scripts want to be stable. A script that doesn't detect randomizing (called a naive script) collects an unstable fingerprint. The more metrics randomized, the greater the chance that a script is naive. Randomized results can always be detected in a crowd, and a script that does that is called an advanced script. It can do this either by first party mathematical proofs (e.g. known pixel tests in canvas), checking with third parties (see EFF's Cover Your Tracks), by inference (e.g. knowing you are Tor Browser and already exactly knowing your protection techniques because you are open source), or simply from deviations from expected or collected results (i.e adversarial). Any script or backend tooling that can detect this renders the randomization moot. Given _any_ script _could_ be advanced and detect all randomizing, for all intents and purposes, raising entropy can be treated the same as lowering entropy/buckets. Besides fooling naive scripts, randomizing can make sense depending on the metric and usability. For example in canvas rendering, subtle randomizing of some pixels can render a usable result for the user versus rendering an unusable result to prevent averaging in Tor Browser (this is due to the threat model) Whereas randomizing the userAgent string wouldn't provide any extra benefit for the user (and may cause compatability issues) versus restricting it to equivalency of platform, engine and version (all of which can't be protected). Randomizing also - adds code complexity (protecting the seed) - is usually hard to stay benign - incurs high maintenance costs - exacts performance burdens - carries risk where poor implementation can and has numerous times in the past led to bypasses (not robust), averaging, or reversal of the protection (too subtle) Unless there is a net benefit, such as usability, randomizing is better left unused.
This is a crowd. We are ignoring equivalency which can't be hidden. That is we know there are buckets caused by differences such as platform (windows and mac etc), language, screen sizes etc. So ignore that. Now picture each wally as a bucket of non-equivalancy or non-base metrics. Scripts have to work harder to detect meaningful differences, and each Wally/bucket is hopefully shared by many users.
Fingerprint resistance (against advanced scripts) can only be achieved in a crowd, e.g. Tor Browser. Therefore, to test values, it only makes sense to test within the crowd - e.g. Tor Browser users. Similarly, analysis of data collected should be restricted to crowd users. Any other browser data is irrelevant. Tests should be robust, i.e. all methods and sources for the crowd. Tor Project assumes advanced scripts and worst outcomes
In testing there are two dataset outcomes or objectives that are useful: the **buckets** (values), and the **entropy**. A dataset is not guaranteed to get all possible buckets as it is a random sample. The larger the dataset, the better the chance more values, or the maximum buckets, are detected.
- each value only needs to be recorded once - for protected metrics, any value outside expected results indicates protection is not working as expected - different values can indicate where protection may be lacking Examples Tor Browser (buckets) - navigator.language (single) determines locale, so a combined value that shows a mismatch or any navigator.languages (plural) not supported would indicate a protection failure
- the number of occurences of each value matters, as this determines the probabilities used to caculate entropy - it is therefore important to not taint the dataset with repeat-user data - in studies and surveys, state tracking (e.g. an IP address and/or a cookie) is commonly utilized to help mitigate this, but becomes problematic due to sanitizing and IP address protections - datasets need to be real-world - i.e. representative of reality and not unduely influenced by specific demographics or groups such as privacy and fingerprinting enthusiasts - datasets need to be large enough to make inferences about the crowd with confidence - see "Test Sites > Entropy" below for examples In other words, **_what is tested, how it's tested, how robust it's tested_**, **_how that data is collected_**, and **_how much data is collected all matter_**.
So how can a resistance strategy, either overall or for an individual metric, know if it's resistance is working without large scale datasets of their users and without any telemetry? It depends on the metric and resistance method, but for some the effectiveness can be known or estimated with **some certainty** If it is "known" there is only one bucket, then the entropy in the crowd "is" zero (assuming no leaks or bugs and the tests are robust), and the protection cannot resist more Example (zero entropy) timezoneName - everyone is timezone Atlantic/Reykjavik - tested for, hardcoded, can't be overriden by prefs or the system, is **restricted** meaning it's always "truthful" everywhere including deterministic results in APIs such as Date and Intl If it is "known" that the buckets (plural) are at the minimum possible (equivalency), then the protection cannot resist more, and entropy can only sometimes be estimated with any degree of certainty Example (equivalency) userAgent - always one of four possible results based on equivalency of platform: i.e Windows, Mac, Linux or Android - entropy can only be _estimated_ e.g. based on downloads per platform If it is "hopeful" that the buckets are at the minimum possible... Example (hopeful) font enumeration - all users _should_ have the same fonts per platform - but there may be protection gaps as the browser cannot control the system's fonts or users removing fonts
- tests are not robust: therefore they do not show the full potential - tests are not comprehensive; i.e not measuring enough metrics - tests may (and do) have bugs: e.g not stable, not reproducible, not accurate - tests are not universal between sites (so you can't compare) For example, one site may report canvas as randomized, another site will report a hash (and claim it's unique). Who do you believe? What did they test (offscreenCanvas, toDataURL, and a hundred other parameters)? **IF YOU DON'T KNOW WHAT IS TESTED, HOW IT'S TESTED, OR HOW ROBUST IT'S TESTED, THEN YOU CAN'T TRUST ANY VALUES**.
- if tests have bugs, are not comprehensive, or are not robust then the entropy is meaningless - the entropy is meaningless anyway as the data is tainted - it is not real-world - it is not one-result-per-user - the datasets are too small
Whilst population and/or demographics and/or other methods for estimations are not pefect comparisions to internet traffic, in these examples it is sufficient to prove the point.
Consider by just how much each metric's probability (and entropy) are skewed, and be aware that each metric can affect many other metrics Language - the language metric demonstrates how demographics are important and something so simple can _easily and inadvertently_ start (**call it 3x**) to taint datasets from reality - something surveys and studies need (and try) to address and/or acknowledge. - test sites may language-centric: e.g. EFF's Cover Your Tracks is english only (AFAICT) which will skew results Firefox - FF is marketed and known as a privacy focused browser (and has ties with Tor Browser and RFP) - so it makes sense that this demographic is _a bit more interested_ (**call it 10x**) in testing than the average. timezoneName - outside of real world use, only Tor Browser, Mullvad Browser and Firefox's RFP use Atlantic/Reykjavik (a recent change in 2024) in any meaningful numbers. - This set of users are _highly interested_ (**call it 160x**) in testing (and re-testing and tweaking and re-testing) - and keep in mind that this group (RFP users) will taint hundreds of other data points or metrics due to the wide resistance built into RFP