Table of Contents
- Authors
- Introduction
- Goals
- API changes
- Data processing through a Secure Aggregation Service
- Privacy considerations
- Ideas for future iteration
- Considered alternatives
This document proposes extensions to our existing Attribution Reporting API that reports event-level data. The intention is to provide a mechanism for rich metadata to be reported in aggregate, to better support use-cases such as campaign-level performance reporting or conversion values.
Aggregatable attribution reports should support legitimate measurement use cases not already supported by the event-level reports. These include:
- Higher fidelity measurement of attribution-trigger (conversion-side) data, which is very limited in event-level attribution reports, including the ability to sum values rather than just counts
- A system which enables the most robust privacy protections
- Ability to receive aggregatable reports alongside event-level reports
- Ability to receive data at a faster rate than with event-level reports
- Greater flexibility to trade off attribution-trigger (conversion-side) data, reporting rate and accuracy.
Note: fraud detection (enabling the filtration of reports you are not expecting) is a goal but it is left out of scope for this document for now.
The client API for creating contributions to an aggregate report uses the same API base as for event level reports with a few extensions. In the following example an ad-tech will use the API to collect:
- Aggregate conversion counts at a per-campaign level
- Aggregate purchase values at a per geo level
Registering sources eligible for aggregate reporting entails adding new
aggregation_keys
and aggregatable_expiry
dictionary fields to the JSON dictionary of the
Attribution-Reporting-Register-Source
header:
{
... // existing fields, such as `source_event_id` and `destination`
"aggregation_keys": {
// Generates a "0x159" key piece (low order bits of the key) for the key named
// "campaignCounts".
"campaignCounts": "0x159", // User saw ad from campaign 345 (out of 511)
// Generates a "0x5" key piece (low order bits of the key) for the key named "geoValue".
"geoValue": "0x5" // Source-side geo region = 5 (US), out of a possible ~100 regions
},
"aggregatable_expiry": "[64-bit signed integer]"
}
This defines a dictionary of named aggregation keys, each with a piece of the aggregation key defined as a hex-string. The final histogram bucket key will be fully defined at trigger time using a combination (binary OR) of this piece and trigger-side pieces.
Final keys will be restricted to a maximum of 128 bits. This means that hex strings in the JSON must be limited to at most 32 digits.
aggregatable_expiry
is an optional string field containing a signed 64-bit
integer representing the number of seconds for which the source is eligible for
aggregatable attribution. If omitted, the field defaults to the value specified
in the expiry
field.
Trigger registration will also add two new fields to the JSON dictionary of the
Attribution-Reporting-Register-Trigger
header:
{
... // existing fields, such as `event_trigger_data`
"aggregatable_trigger_data": [
// Each dict independently adds pieces to multiple source keys.
{
// Conversion type purchase = 2 at a 9 bit offset, i.e. 2 << 9.
// A 9 bit offset is needed because there are 511 possible campaigns, which
// will take up 9 bits in the resulting key.
"key_piece": "0x400",
// Apply this key piece to:
"source_keys": ["campaignCounts"]
},
{
// Purchase category shirts = 21 at a 7 bit offset, i.e. 21 << 7.
// A 7 bit offset is needed because there are ~100 regions for the geo key,
// which will take up 7 bits of space in the resulting key.
"key_piece": "0xA80",
// Apply this key piece to:
"source_keys": ["geoValue", "nonMatchingKeyIdsAreIgnored"]
}
],
"aggregatable_values": {
// Each source event can contribute a maximum of L1 = 2^16 to the aggregate
// histogram. In this example, use this whole budget on a single trigger,
// evenly allocating this "budget" across two measurements. Note that this
// will require rescaling when post-processing aggregates!
// 1 count = L1 / 2 = 2^15
"campaignCounts": 32768,
// Purchase was for $52. The site's max value is $1024.
// $1 = (L1 / 2) / 1024.
// $52 = 52 * (L1 / 2) / 1024 = 1664
"geoValue": 1664
}
}
The aggregatable_trigger_data
field is a list of dict which generates
aggregation keys.
The aggregatable_values
field lists an amount of an abstract "value" to
contribute to each key, which can be integers in [1, 2^16). These are attached
to aggregation keys in the order they are generated. See the contribution
budgeting section for more details on how
to allocate these contribution values.
The scheme above will generate the following abstract histogram contributions:
[
// campaignCounts
{
key: 0x559, // = 0x159 | 0x400
value: 32768
},
// geoValue:
{
key: 0xA85, // = 0x5 | 0xA80
value: 1664
}]
Note: The filters
field will still apply to aggregatable reports, and each
dict in aggregatable_trigger_data
can still optionally have filters applied
to it just like for event-level reports.
Note: the above scheme was used to maximize the contribution budget and optimize utility in the face of constant noise. To rescale, simply inverse the scaling factors used above:
L1 = 1 << 16
true_agg_campaign_counts = raw_agg_campaign_counts / (L1 / 2)
true_agg_geo_value = 1024 * raw_agg_geo_value / (L1 / 2)
Trigger registration will accept an optional field
aggregatable_deduplication_key
which will be used to deduplicate multiple
triggers containing the same aggregatable_deduplication_key
for a single
source.
{
...
"aggregatable_deduplication_key": "[unsigned 64-bit integer]"
}
The browser will create aggregatable reports for a source only if the trigger's
aggregatable_deduplication_key
has not already been associated with an
aggregatable report for that source.
Note that aggregatable trigger registration is independent of event-level trigger registration.
Aggregatable reports will look very similar to event-level reports. They will be
reported to the reporting origin at the path
.well-known/attribution-reporting/report-aggregate-attribution
.
The report itself does not contain histogram contributions in the clear. Rather, the report embeds them in an encrypted payload that can only be read by a trusted aggregation service known by the browser.
The report will be JSON encoded with the following scheme:
{
// Info that the aggregation services also need encoded in JSON
// for use with AEAD.
"shared_info": "{\"api\":\"attribution-reporting\",\"attribution_destination\":\"https://advertiser.example\",\"report_id\":\"[UUID]\",\"reporting_origin\":\"https://reporter.example\",\"scheduled_report_time\":\"[timestamp in seconds]\",\"source_registration_time\":\"[timestamp in seconds]\",\"version\":\"[api version]\"}",
// Support a list of payloads for future extensibility if multiple helpers
// are necessary. Currently only supports a single helper configured
// by the browser.
"aggregation_service_payloads": [
{
"payload": "[base64-encoded HPKE encrypted data readable only by the aggregation service]",
"key_id": "[string identifying public key used to encrypt payload]",
// Optional debugging information, if the cookie `ar_debug` is present.
"debug_cleartext_payload": "[base64-encoded unencrypted payload]",
},
],
// Optional debugging information (also present in event-level reports),
// if the cookie `ar_debug` is present.
"source_debug_key": "[64 bit unsigned integer]",
"trigger_debug_key": "[64 bit unsigned integer]"
}
Reports will not be delayed to the same extent as they are for event level reports. The browser will delay them with a random delay between 10 minutes to 1 hour, or with a small delay after the browser next starts up. The minimum 10 minutes delay allows regretful users to have a chance to delete the reports. The browser is free to utilize techniques like retries to minimize data loss.
-
The
shared_info
will be a serialized JSON object. This exact string is used as authenticated data for decryption, see below. The string therefore must be forwarded to the aggregation service unmodified. The reporting origin can parse the string to access the encoded fields. -
The
api
field is a string enum identifying the API that triggered the report. This allows the aggregation service to extend support to other APIs in the future. -
The
scheduled_report_time
will be the number of seconds since the Unix Epoch (1970-01-01T00:00:00Z, ignoring leap seconds) to align with DOMTimestamp until the browser initially scheduled the report to be sent (to avoid noise around offline devices reporting late). -
The
source_registration_time
will represent (in seconds since the Unix Epoch) the time the source event was registered, rounded down to a whole day. -
The
payload
will contain the actual histogram contributions. It should be be encrypted and then base64 encoded, see below.
Optional debugging fields are discussed below.
The payload
should be a CBOR map encrypted via
HPKE and then base64
encoded. The map will have the following structure:
// CBOR
{
"operation": "histogram", // Allows for the service to support other operations in the future
"data": [{
"bucket": <bucket, encoded as a 16-byte (i.e. 128-bit) big-endian bytestring>,
"value": <value, encoded as a 4-byte (i.e. 32-bit) big-endian bytestring>
}, ...]
}
Optionally, the browser may encode multiple contributions in the same payload; this is only possible if all other fields in the report/payload are identical for the contributions.
This encryption should use AEAD
to ensure that the information in shared_info
is not tampered with, since the
aggregation service will need that information to do proper replay protection.
The authenticated data will consist of the shared_info
string (encoded as
UTF-8) with a constant prefix added for domain separation, i.e. to avoid
ciphertexts being reused for different protocols, even if public keys are
shared.
The encryption will use public keys specified by the aggregation service. The browser will encrypt payloads just before the report is sent by fetching the public key endpoint with an un-credentialed request. The processing origin will respond with a set of keys which will be stored according to standard HTTP caching rules, i.e. using Cache-Control headers to dictate how long to store the keys for (e.g. following the freshness lifetime). The browser could enforce maximum/minimum lifetimes of stored keys to encourage faster key rotation and/or mitigate bandwidth usage. The scheme of the JSON encoded public keys is as follows:
{
"keys": [
{
"id": "[arbitrary string identifying the key (up to 128 characters)]",
"key": "[base64 encoded public key]"
},
// Optionally, more keys.
]
}
To limit the impact of a single compromised key, multiple keys (up to a small limit) can be provided. The browser should independently pick a key uniformly at random for each payload it encrypts to avoid associating different reports. Additionally, a public key endpoint should not reuse an ID string for a different key. In particular, IDs must be unique within a single response to be valid. In the case of backwards incompatible changes to this scheme (e.g. in future versions of the API), the endpoint URL should also change.
Note: The browser may need some mechanism to ensure that the same set of keys are delivered to different users.
If debugging
is enabled, additional debug fields will be present in aggregatable reports.
The source_debug_key
and trigger_debug_key
fields match those in the
event-level reports. If both the source and trigger debug keys are set, there
will be a debug_cleartext_payload
field included in the report. It will
contain the base64-encoded cleartext of the encrypted payload to allow downstream
systems to verify that reports are constructed correctly. If both debug keys are
set, the shared_info
will also include the flag "debug_mode": "enabled"
to
allow the aggregation service to support debugging functionality on these reports.
Additionally, a duplicate debug report will be sent immediately (i.e. without the
random delay) to a
.well-known/attribution-reporting/debug/report-aggregate-attribution
endpoint.
The debug reports should be almost identical to the normal reports, including the
additional debug fields. However, the payload
ciphertext will differ due to
repeating the encryption operation and the key_id
may differ if the previous
key had since expired or the browser randomly chose a different valid public key.
Each attribution can make multiple contributions to an underlying aggregate histogram, and a given user can trigger multiple attributions for a particular source / trigger site pair. Our goal in this section is to bound the contributions any source event can make to a histogram.
This bound is characterized by a single parameters: L1
, the maximum sum of the
contributions (values) across all buckets for a given source event. L1 refers to
the L1 sensitivity / norm of the histogram contributions per source event.
Exceeding these limits will cause future contributions to silently drop.
While exposing failure in any kind of error interface can be used to leak
sensitive information, we might be able to reveal aggregate failure results via
some other monitoring side channel in the future.
For the initial proposal, set L1 = 65536
. Note that for privacy, this
parameter can be arbitrary, as noise in the aggregation service will be scaled
in proportion to this parameter. In the example above, the budget is split
equally between two keys, one for the number of conversions per campaign and the
other representing the conversion dollar value per geography. This budgeting
mechanism is highly flexible and can support many different aggregation
strategies as long as the appropriate scaling is performed on the outputs.
The browser may apply storage limits in order to prevent excessive resource usage.
Strawman: There should be a limit of 1024 pending aggregatable reports per destination site.
Note: The storage limits for event-level and aggregatable reports are enforced independently of each other.
The exact design of the service is not specified here. We expect to have more information on the data flow from reporter → processing origins shortly, but what follows is a high-level summary.
As the browser sends individual aggregatable reports to the reporting origin,
the reporting origin organizes them into
batches. They can send these
batches to the aggregation service origin
specified in the report.
The aggregation service will aggregate reports within a certain batch, and respond back with an aggregate histogram, i.e. a list of keys with associated aggregate values. It is expected that as a privacy protection mechanism, a certain amount of noise will be added to each output key's aggregate value.
This proposal introduces a new set of reports to the API. Alone they do not add much meaningful cross-site information, so they are fairly benign. However, they contain encrypted payloads which allow aggregate histograms to be computed.
These histograms should be protected with various techniques with a trusted
server system. For example, it is expected that the histograms will be subject
to noise proportional to the L1
budget. Additionally, most rate-limits (except
for the maximum number of reports per source) used for event-level reports will
also be enforced for aggregatable reports, which limit the total amount of
information that can be sent out for any one user.
Servers will need to be implemented such that browsers can trust them with sensitive cross-site data. There are various technologies and techniques that could be employed (e.g. Trusted Execution Environments, Multi-party-computation, audits, etc) that could satisfy browsers that data is safely aggregated and the output maintains proper privacy.
A goal of this work is to have a framework which can support differentially
private aggregate measurement. In principle this can be achieved if the
aggregation service adds noise proportional to the L1
budget in principle,
e.g. noise distributed according to Laplace(epsilon / L1) should achieve epsilon
differential privacy. With small enough values of epsilon, reports for a given
source will be well-protected in an aggregate release.
Note: there are a few caveats about a formal differential privacy claim:
-
In the current design, the number of encrypted reports is revealed to the reporting origin in the clear without any noise. See [Hide the true number of attribution reports](#Hide the-true-number-of-attribution-reports).
-
The scope of privacy in the current design is not user-level, but per-source. See More advanced contribution bounding for follow-up work exploring tightening this.
-
Our plan is to adjust the level of noise added based on feedback during the origin trial period, and our goal with this initial version is to create a foundation for further exploration into formally private methods for aggregation.
Various rate limits outlined in the event-level explainer should also apply to aggregatable reports. The limits should be shared across all types of reports.
At trigger time, we could have a worklet-style API that allows passing in an arbitrary string of “trigger context” that specifies information about the trigger event (e.g. high fidelity data about a conversion). From within the worklet, code can access both the source and trigger context in the same function to generate an aggregate report. This allows for more dynamic keys than a declarative API (like the existing HTTP-based triggering), but disallows exfiltrating sensitive cross-site data out of the worklet.
The worklet is used to generate histogram contributions, which are key-value pairs of integers. Note that there will be some maximum number of keys (e.g. 2^128 keys).
The following code triggers attribution by invoking a worklet.
await window.attributionReporting.worklet.addModule(
"https://reporter.example/convert.js");
// The first argument should match the origin of the module we are invoking, and
// determines the scope of attribution similar to the existing HTTP-based API,
// i.e. it should match the "attributionreportto" attribute.
// The last argument needs to match what AggregateAttributionReporter uses upon
// calling registerAggregateReporter
window.attributionReporting.triggerAttribution("https://reporter.example",
<triggerContextStr>, "my-aggregate-reporter");
Internally, the browser will look up to see which source should be attributed, similar to how attribution works in the HTTP-based API. Note here that only a single source will be matched.
Here is convert.js
which crafts an aggregate report.
class AggregateAttributionReporter {
// attributionSourceContext set as "<campaignid>,<geoid>"
processAggregate(triggerContext, attributionSourceContext, sourceType) {
let [campaign, geo] = attributionSourceContext.split(',').map(
x => parseInt(x, 10))
let purchaseValue = parseInt(triggerContext, 10)
histogramContributions = [
{key: campaign, value: purchaseValue},
{key: geo, value: purchaseValue},
];
return {
histogramContributions: histogramContributions,
}
}
}
// Bound classes will be invoked when an attribution triggered on this document
// is successfully attributed to a source whose reporting origin matches the
// worklet origin.
registerAggregateReporter("my-aggregate-reporter", AggregateAttributionReporter);
This worklet approach provides greatly enhanced flexibility at the cost of
complexity. It introduces a new security / privacy boundary, and there are
several edge cases that must be handled carefully to avoid data loss (e.g. the
document being destroyed while the worklet is processing, unless a
keepalive
-style mode for worklets is introduced). These issues must be solved
before this design could be considered.
The worklet based scheme possibly allows for more flexible attribution options, including specifying partial “credit” for multiple previous attribution sources that would provide value to advertisers that are interested in attribution models other than last-touch.
We should be careful in allowing reports to include cross site information from multiple sites, as it could increase the risk of cross site tracking.
The presence or absence of an attribution report leaks some potentially sensitive cross-site data in the current design. Therefore, revealing the total count of reports to the reporting origin could leak something sensitive as well (imagine if the reporting origin only ever registered a conversion or impression for a single user).
To hide the true number of reports, we could:
- Unconditionally send a null report for every registered attribution trigger (thus making the count a function of only destination-side information)
- Add noise to the number of reports by having some clients randomly add noisy null reports. This technique would have to assume some threshold number of unattributed triggers to maintain privacy.
We might want to consider bounding contribution at levels other than the source event level in the future for more robust privacy protection. For instance, we could bound user contributions per trigger event, or even by {source site, destination site, day} for a more user-level bound.
This would likely come at the cost of some utility and complexity, as budgeting across multiple source events may not be straightforward.
Additionally, there are more sophisticated techniques that can optimize utility and privacy if we bound more than just the L1 norm of the aggregate histogram. For instance, we could impose a stricter Linf bound (i.e. bounding the contribution to any one bucket). Care should be taken to ensure that either:
- A proper compromise is met across various use-cases
- We can support multiple types of contribution bounding for different reporting origins without introducing privacy leaks
See issue 249 for more details.
The server can add an optional alternative_aggregation_mode
string field:
Attribution-Reporting-Register-Source: {..., "aggregation_keys": ..., "alternative_aggregation_mode": "experimental-poplar"}
The optional field will allow developers to choose among different options for
aggregation infrastructure supported by the user agent. This value will allow
experimentation with new technologies, and allows us to try out new approaches
without interfering with core functionality provided by the default option. The
"experimental-poplar"
option will implement a protocol similar to
poplar VDAF
in the PPM Framework.
There are some use-cases which require something close to binary input (i.e. counting conversions), and other conversions which require summing in some discretized domain (e.g. summing conversion value).
For simplicity in this API we are treating these exactly the same. Count-based approaches could do something like submitting two possible values, 0 for 0 and MAX_VALUE for 1, and consider the large space to be just a discretized domain of fractions between 0 and 1.
This has the benefit of keeping the aggregation infrastructure generic and avoids the need to “tag” different reports with whether they represent a coarse-grained or fine-grained value.
In the end, we will use this MAX_VALUE to scale noise via computing the sensitivity of the computation, so submitting “1” for counts will yield more noise than otherwise expected.
A binary report format like CBOR could streamline AEAD authentication by passing raw bytes directly to the reporting origin, which could be passed through directly to an aggregation service. This design would avoid parsing / serialization errors in constructing authenticated data necessary to decrypt the payload.
However, binary formats are not as familiar to developers, so there is an ergonomics tradeoff to be made here.