privacy-engineering-glossary

Defining and exploring privacy computation concepts and technologies.

A

`Aggregation`

A statistical method to combine a collection of raw data and output it as a total or summary of that data. Useful for analytics and research. Offers a modicum of privacy protection, but still susceptible to re-identification attacks.

Read more: What Is Data Aggregation?
Related: Re-identification

`Algorithmic fairness`

The notion that algorithms should make decisions without "bias", ensuring equitable treatment across different demographic groups in a dataset.

Read more: What is Algorithm Fairness?

`Anonymization`

The "process by which personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party" (ISO 25237:2017).

Data is erased or overwritten absolutely in an attempt to delink it from an individual.
Related: De-identification, Pseudonymization

`Authorization (AuthZ)`

A set of permissions for an authenticated user.

Mechanisms include: Role-based access control (RBAC), Policy-based access control (PBAC)
Read more: Authn vs. authz: How are they different?
Related: Authentication (AuthN)

`Authenticaton (AuthN)`

Ensuring a person (or system) is who they claim they are by verifying their identity.

Mechanisms include: Asymmetric/public-key cryptography (certificate exchange), username/password login, biometrics, et. al
Related: Authorization (AuthZ)

`Availability`

Data is readily served upon request without delay or downtime.
Read more: What is the CIA Triad and Why is it important?
Related: Confidentiality, Integrity, CIA triad

B

`Background knowledge attack`

A type of attack against k-anonymity where an adversary uses external information to infer sensitive data about individuals from anonymized datasets.

Read more:
- Background knowledge attacks in privacy-preserving data publishing models
- ℓ-Diversity: Privacy Beyond k-Anonymity
Related: K-anonymity, Re-identification

C

`CIA triad`

Confidentiality, Integrity, Availability. An archetype of cybersecurity.

Read more: What is the CIA Triad and Why is it important?
Related: Confidentiality, Integrity, Availability

`Confidentiality`

Data is protected as to prevent any unauthorized access, "whether intentional or accidental".

Read more: What is the CIA Triad and Why is it important?
Related: CIA triad, Integrity, Availability

`Consent management platform (CMP)`

A data governance tool to obtain, record, map, and manage user consent for data collection and processing in compliance with privacy regulations.

Read more: What is Consent Management Platform (CMP) & Why Do You Need It?

D

`Data classification`

A hierarchy or class system used to assign risk or sensitivity levels to data, where higher levels indicate higher risk or sensitivity (and potentially prescribe stricter controls). The most popular example is the US government's information classification system, which includes "Top Secret" and "Classified".

Read more:
- How Are US Government Documents Classified?
- Data Classification

`Data lineage`

A complete history of the lifecycle of a data point, from beginning to end. This includes all transformations, movement, and other operations performed on the data.

Lineage (n): Descent in a line from a common progenitor

Read more:
- What is Data Lineage and Data Provenance? Quick Overview
- Data Lineage tools and its Best Practice | Complete Guide
Related: Data provenance

`Data minimization`

The principle of limiting data collection to only what is necessary for a specific purpose, reducing privacy "attack surface".

Read more: Data minimization: An increasingly global concept

`Data provenance`

A high-level overview of "where the data comes from."

Provenance (n): Place of origin; derivation.

Read more:
- What is Data Provenance?
- What is Data Lineage and Data Provenance? Quick Overview
Related: Data lineage

`Data quality`

The relative accuracy, completeness, reliability, and relevance of data which is important for compliance and relevant analytical efforts. It is important to avoid storing "stale" data.

Read more: Understanding Data Quality

`Data tagging`

The process of labeling data with metadata to enhance its organization, retrieval, and compliance with privacy regulations.

Distinct, machine-readable inputs used to filter data across systems
Usually represented as or similar to a regular expression
Read more: Data Tagging: What You Need to Know

`De-identification`

The general process of removing or altering personal information from a dataset to prevent the identification of individuals, while still allowing for useful data analysis.

Read more: De-identification: A Primer
Related: Anonymization

`Delta (δ) presence`

A mathematical concept that refers to the ability to determine whether a specific individual is present in a dataset based on the changes in data over time.

"This is slightly different than re-identification risk in that the goal is not to find which exact record corresponds which individual, only to know whether an individual is part of the dataset."
Read more: About δ-presence

`Deterministic algorithm`

An algorithm that "always returns the same result given the same input parameters in the same state" of a dataset.

Read more:
- Deterministic function in MySQL
- Difference between Deterministic and Non-deterministic Algorithms
Related: Non-deterministic algorithm

`Differential privacy`

A privacy-preserving mathematical framework to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying individual entries in the dataset by introducing noise.

Read: A Survey of Differential Privacy Frameworks
Watch: What is Differential Privacy?
Related: Privacy-enhancing technology (PET)

`Downcoding attack`

An attack against "quasi-identifier-based deidentification techniques (QI-deidentification)[,] including k-anonymity, l-diversity, and t-closeness."

F

`Federated learning`

A decentralized machine learning approach that enables disparate data sources (nodes) to collaboratively train a central model, without having training data ever leave any data source or be sent to the central (federated) server.

H

`Homogeneity attack`

An attack on a k-anonymous dataset where the attacker exploits "groups that leak information due to lack of diversity in the sensitive attribute." To address this, a sanitized table should be “diverse”, where "all tuples that share the same values of their quasi-identifiers should have diverse values for their sensitive attributes."

Read more: ℓ-Diversity: Privacy Beyond k-Anonymity
Related: K-anonymity, Background knowledge attack

`Homomorphic encryption`

A "method to encrypt data and perform operations on it" without having to decrypt the data. Particular use cases include financial, health, and other environments with highly sensitive data.

There are several types of homomorphic encryption
A popular example is the asymetric algorithm RSA, which is partially homomorphic
Read more:
- What is Homomorphic Encryption?
- Homomorphic Encryption: How It Works
Related: Privacy-enhancing technology (PET)

I

`Insecure direct object reference (IDOR)`

A "vulnerability that arises when attackers can access or modify objects by manipulating identifiers used in a web application's URLs or parameters." Poor access controls fail to verify if a user is authorized to access data (such as that of another user).

Read more: Insecure Direct Object Reference Prevention Cheat Sheet
Related: CIA triad, Integrity, Availability

`Integrity`

Data is free from corruption, manipulation, or other unknown modifications. Data is "authentic, accurate, and reliable".

Read more: What is the CIA Triad and Why is it important?
Related: Availability, CIA triad, Confidentiality

K

`K-anonymity`

"A property of anonymized data" with "scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."

Read more:
Related: Privacy-enhancing technology (PET)

`K-map`

A similar approach to k-anonymity, "except that it assumes that the attacker most likely doesn't know who is in the dataset."

Read more:
Related: Privacy-enhancing technology (PET)

L

`ℓ-diversity`

An attempt to address k-anonymity attacks (such as homogeneity and background knowledge attacks), which "attempts to measure how much an attacker can learn about people in terms of k-anonymity and equivalence classes".

Read more:
- About l-diversity
- ℓ-Diversity: Privacy Beyond k-Anonymity
Related: Privacy-enhancing technology (PET)

N

`Non-deterministic algorithm`

An algorithm that "does not necessarily always return the same result given the same input parameters in the same state" of a dataset. (Stack Overflow)

Read more: Difference between Deterministic and Non-deterministic Algorithms
Related: Deterministic algorithm

P

`Privacy by design`

A collection of data privacy principles proposed by Dr. Ann Cavoukian to take a "proactive approach to privacy that emphasises the need to incorporate data protection practices into projects and decisions at the outset, rather than as an afterthought."

Read more: Privacy by Design The 7 Foundational Principles
Related: Software development lifecycle

`Privacy-enhancing technology (PET)`

An assortment of tools that that enable businesses to comply with privacy regulations while preserving the individual privacy and utility of their data sets, for purposes such as analytics, development, sharing, etc.

Popular PETs: Homomorphic encryption, de-identification (k-anonymity), differential privacy, federated learning, secure multiparty computation, private set intersection, synthetic data, zero knowledge proofs, trusted execution environments
Read more:
- What are Privacy-Enhancing Technologies (PETs)? Types & Selection Guide
- What Are Privacy-Enhancing Technologies (PETs) And How You Can Choose the Right One(s)

`Private set intersection (PSI)`

Also called "double encryption". A type of secure multiparty computation "in which each party has a set of items and the goal is to learn the intersection of those sets while revealing nothing else about those sets."

Read more: A Brief Overview of Private Set Intersection
Related: Privacy-enhancing technology (PET), Secure multi-party computation

`Pseudonymization (tokenization)`

A form of de-identification where personal identifiers are replaced with placeholder values (or "tokens"). Unlike anonymization, pseudonymization does not alter the original data, which can still be linked to an individual, and pseudonymization is reversible.

Read more:
- What is pseudonymization?
- Pseudonymization
Related: Privacy-enhancing technology (PET)

R

`Re-identification`

An attack to identify individuals in an "anonymized" dataset using external information and computing techniques to link individuals to their "de-identified" personal information.

S

`Secure multi-party computation (S/MPC)`

Also known as "privacy-preserving computation". [Any] "cryptographic protocol that distributes a computation across multiple parties where no individual party can see the other parties’ data."

Read more: What is Secure Multiparty Computation
Related: Privacy-enhancing technology (PET), Private set intersection (PSI)

`Scream test`

A test to determine if access to a resource (like personal information, a server, etc.) is still necessary by shutting off access to the resource entirely; if nobody screams bloody murder, then they didn't really need that resource; this can be applied to enforce data minimization
Read more: Microsoft uses a scream test to silence its unused servers

`Software development lifecycle (SDLC)`

The process that an organization follows to design, develop, test, deploy, and maintain software. "Shifting privacy left" is the idea to engrain privacy into earlier into product ideation and requirements drafting in order to achieve privacy by design.

Read more: Integrating Privacy Practices in Software Development Lifecycle
Related: Privacy by design

`Structured data`

Organized and machine-readable data that can be queried programatically, such as a relational (SQL) database
Read more: Structured vs. Unstructured Data: What’s the Difference?
Related: Unstructured data

`Static code analysis`

A process to scan the source code of a product to identify personal data to understand how and where it is collected and processed throughout systems.

Read more:
- The case for static code analysis for privacy
- Privacy code scanning: How to sync privacy compliance with software development

`Synthetic data`

Data, generated by artificial intelligence, that mathematically imitates real-world personal data, which enables orgs to use "privacy-compliant, production-like, and long-retention" data for analytics
Read more: What is Synthetic Data?
Watch: PEPR '24 - Compute Engine Testing with Synthetic Data Generation

T

`Threat modeling`

A process (originally from cybersecurity) to identify and understand threats to a system and their mitigations. In the context of privacy, this relates to threats to personal information in a system and those data subjects.

`Trusted execution environment (TEE)`

A segregated safe zone within a CPU where only signed code within the environment can be loaded, and all code is "processed in the clear but is only visible in encrypted form when anything outside tries to access it." This ensures that "even if a system is compromised, the data within the TEE remains secure."

U

`Unstructured data`

Unorganized data that has no pre-defined structure or pattern that makes it difficult(but not impossible) to reliably process and analyze programatically. "More than 80%" of data on the internet is unstructured.
Read more: Structured vs. Unstructured Data: What’s the Difference?
Related: Structured data

Z

`Zero knowledge proof`

A "cryptographic mechanism that allows anyone to prove the truth of a statement without having to share the information in a statement."

Read more:
- Hacker Lexicon: What Are Zero-Knowledge Proofs?
- 4 Ways to Compare Trusted Execution Environments and Zero-Knowledge Proofs

`zk-SNARK`

Zero-Knowledge Succinct Non-Interactive Argument of Knowledge
Allows a Prover "[to create] a unique fingerprint for each proof [...] making it impossible to reverse-engineer the original statement. Essentially, these polynomial equations are solvable only by the Prover but verifiable by anyone. They're a puzzle to which only the Prover knows the answer, yet anyone can confirm the answer is correct without knowing what it is."
Read more:
- What is a zk-SNARK?
- What are zk-SNARKs?

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

coreystone/privacy-engineering-glossary

Folders and files

Latest commit

History

Repository files navigation

privacy-engineering-glossary

A

Aggregation

Algorithmic fairness

Anonymization

Authorization (AuthZ)

Authenticaton (AuthN)

Availability

B

Background knowledge attack

C

CIA triad

Confidentiality

Consent management platform (CMP)

D

Data classification

Data lineage

Data minimization

Data provenance

Data quality

Data tagging

De-identification

Delta (δ) presence

Deterministic algorithm

Differential privacy

Downcoding attack

F

Federated learning

H

Homogeneity attack

Homomorphic encryption

I

Insecure direct object reference (IDOR)

Integrity

K

K-anonymity

K-map

L

ℓ-diversity

N

Non-deterministic algorithm

P

Privacy by design

Privacy-enhancing technology (PET)

Private set intersection (PSI)

Pseudonymization (tokenization)

R

Re-identification

S

Secure multi-party computation (S/MPC)

Scream test

Software development lifecycle (SDLC)

Structured data

Static code analysis

Synthetic data

T

Threat modeling

Trusted execution environment (TEE)

U

Unstructured data

Z

Zero knowledge proof

zk-SNARK

About

Resources

Stars

Watchers

Forks