p5y - Privacy Masking Language

A

Accuracy

Accuracy, in machine learning, measures the proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances. This metric is the most commonly reported by companies to reassure customers and users of the safety and efficacy of using their technology for privacy redaction, with reported metrics often in high 90% range.
While accuracy is a useful metric, it has limitations when evaluating privacy redaction technology. This is because privacy redaction often involves rare, sensitive information, and high accuracy may obscure the model's failure to properly redact crucial private details (false negatives). Additionally, a model that performs well on non-sensitive data but poorly on sensitive data can still achieve high accuracy, masking its real effectiveness in privacy protection.
Thus, other metrics like precision, recall, and F1-score are more appropriate for evaluating privacy redaction systems.

Anonymization

The process of altering or removing identifiable details from unstructured text to prevent the identification of the data subjects related to the text.

Examples: Replacing names with generic placeholders or blurring out addresses in a scanned letter.

Aliases: De-identification, Privacy redaction

Awareness (Privacy)

In the context of data privacy, Awareness refers to the process of identifying, analyzing, and marking private and sensitive information within unstructured data. It involves scanning the data to gain structured insights, such as understanding the types of personal information present, its distribution, and potential risks. This step helps organizations assess their privacy exposure, evaluate compliance with regulations, and make informed decisions about how to manage and protect personal data.

Awareness (Privacy)

D

De-anonymization

De-anonymization is the process of identifying individuals or re-establishing the identity of anonymized data by leveraging additional information or data sources. It involves techniques that can uncover the original identities or sensitive information that were supposed to be protected during the anonymization process.

Density (of personal data)

The density of personal data in a dataset of unstructured text refers to the proportion of content within the text that contains identifiable personal information compared to the total volume of text data. This metric helps assess how concentrated sensitive information is within a given body of unstructured text, such as emails, social media posts, articles, or transcripts.

Calculation

To calculate the density of personal data in unstructured text, you can follow these steps:

Identify Personal Data
Count Relevant Instances
Calculate Density: Divide the number of personal data instances by the total number of words or sentences in the unstructured text, then multiply by 100 to express it as a percentage.

Density=(Number of Personal Data Instances/Total Words)×100

The density of personal data in unstructured text is crucial for assessing privacy risks, ensuring regulatory compliance, developing targeted data management strategies, and enhancing NLP techniques for effectively handling sensitive information.

E

Entity

In the context of privacy redaction, an Entity refers to a specific piece of information that can contribute to identifying an individual or reveal sensitive personal details. Entities are categorized based on the type of personal data they represent and are typically targeted for anonymization, masking, or removal to ensure privacy compliance.

An Entity has a Type, e.g. person name, and associated Entity Values, e.g. John Doe, Mary.

Examples: person name, address, telephone number, gender, occupation.

F

F1-score

The F1-score is the harmonic mean of precision and recall, providing a balanced metric that considers both false positives and false negatives. It ranges from 0 to 1, with 1 being the best possible score.

In privacy redaction, the F1-score balances the trade-off between precision and recall, offering a more comprehensive evaluation of the redaction system. A high F1-score indicates that the system not only captures a large portion of sensitive data (high recall) but also minimizes the over-redaction of non-sensitive information (high precision). This makes it an effective metric for assessing the overall performance of privacy redaction technologies, where both under- and over-redaction can have significant consequences.

G

Globalization (g11n)

Globalization (often abbreviated as g11n, where "11" represents the number of letters between the first and last letters "g" and "n") refers to the comprehensive process of designing and marketing a product or service for a global audience. It encompasses both internationalization (i18n) and localization (l10n), aiming to make the product adaptable to various regions, languages, and cultures while maintaining a unified brand or experience.

I

Internationalization (i18n)

Internationalization (often abbreviated as i18n, where "18" represents the number of letters between the first and last letters "i" and "n") refers to the process of designing software, applications, or systems in a way that they can be easily adapted to different languages, regions, and cultures without requiring significant engineering changes. It involves preparing codebases to support various formats, such as date, time, currency, and text direction, as well as allowing for the translation of user interfaces and content.

Internationalization is typically the first step in making products globally accessible, enabling the simpler process of localization (l10n), which focuses on adapting content to specific regions or languages.

K

K-anonymization

K-anonymization is a data privacy technique used to protect individual identities in a dataset by ensuring that each person's data cannot be distinguished from at least k-1 other individuals. In other words, the data for each individual is made indistinguishable from at least k other people in the dataset, thereby preventing re-identification of any single individual.

L

Label (Privacy)

A Privacy Label is a category or tag assigned to a specific piece of personal data within a text. Labels help identify and classify entities such as names, locations, or other significant information. They should be generic and not contain personal or sensitive information themselves.

Examples:

Good Examples: GIVENNAME, AGE, STREET
Bad Examples: NAME-F, PASSPORT-US (These imply gender and nationality.)

Bad Examples: NAME-F, PASSPORT-US (These imply gender and nationality.)

Schema:

Label Object:
{
"LabelPlaceholder": "string",
"Description": "string"
}

Example:

{
"LabelPlaceholder": "GIVENNAME",
"Description": "Represents a given name or first name of an individual."
}

Label Set (Data Privacy)

A Label Set in the context of data privacy refers to a collection of tags or classifications assigned to specific types of personal or sensitive information within a dataset. These labels are used to identify and categorize data based on privacy concerns, such as personally identifiable information (PII), health data, financial data, or sensitive data. A label set helps organizations manage privacy by enabling the systematic handling of data types that require different levels of protection, redaction, or anonymization.

Components:

Title: Name of the label set.
Description: Explanation of what the label set contains or is used for.
Source: Origin or dataset associated with the label set.
Labels: A list of Label objects included in the set.

Examples of PII Label Sets:

"pii-masking-200k": A label set used for the dataset "pii-masking-200k."
"pii-masking-300k": A label set used for the dataset "pii-masking-300k."
"pii-masking-400k": A label set for "pii-masking-400k."
Label Set Subset: Specific labels within a label set, such as "financial labels in pii-masking-300k."

Schema:

LabelSet Object:
{
"Title": "string",
"Description": "string",
"Source": "string",
"Labels": ["Label Object"]
}

Example:

{
"Title": "pii-masking-300k",
"Description": "Contains labels for the pii-masking-300k dataset.",
"Source": "pii-masking-300k dataset",
"Labels": [
{
  "LabelPlaceholder": "GIVENNAME",
  "Description": "First name of an individual."
},
{
  "LabelPlaceholder": "AGE",
  "Description": "Age of an individual."
},
{
  "LabelPlaceholder": "STREET",
  "Description": "Name of a street."
}
// Additional labels...
]
}

Label Set Selection

Label Set Selection is a crucial step in the data anonymization process, where specific labels (types of information) are chosen based on their relevance and sensitivity to a particular use case. The goal is to determine which data can be safely shared and which must be excluded or anonymized to protect privacy. Not all labels in a dataset may be applicable for every scenario, so selecting the appropriate subset ensures compliance with privacy regulations while maintaining data utility.

Localization (l10n)

Localization (often abbreviated as l10n, where "10" represents the number of letters between the first and last letters "l" and "n") refers to the process of adapting software, applications, or content to meet the specific linguistic, cultural, and regulatory requirements of a particular region or language. This includes translating text, adjusting date and time formats, adapting currency, addressing cultural nuances, and ensuring compliance with local laws and standards.

It typically follows internationalization (i18n), which prepares the product for easier localization.

M

Machine Learning Tasks Relevant for Privacy Masking

Token Classification IOB:

Token classification is a fundamental task in Natural Language Processing (NLP) where each token (word) in a text sequence is assigned a label that categorizes it. The Inside-Outside-Beginning (IOB) format is a common tagging scheme used in this task, especially for Named Entity Recognition (NER).

Token classification using the IOB format helps in systematically identifying and categorizing entities like names, locations, organizations, and more. It is crucial for detecting PII within text data, allowing for targeted anonymization or redaction.

Example (Sourced from pii-masking-400k, entry id: 224845):

Source String: "Report: 531 Dundridge Lane, SO32 1GD."

The IOB tagging would be:

Report -> O
: -> O
531 -> B-BUILDINGNUM
Dundridge -> B-STREET
Lane -> I-STREET
, -> O
SO32 -> B-ZIPCODE
1GD -> I-ZIPCODE
. -> O

Span Extraction:

Span extraction involves identifying and extracting specific segments or "spans" of text that contain relevant information. The model predicts the start and end positions of a text span that corresponds to a particular entity or answer to a query.

Unlike token classification, which labels each token individually, span extraction focuses on capturing the exact substring.

Example sentence:

"Please send the report to jane.doe@example.com by Monday."

A span extraction model would identify:

Span: "jane.doe@example.com"
Start Index: Position of 'j' in "jane"
End Index: Position after 'm' in ".com"
Label: "Email"

Span extraction is used to precisely locate PII within unstructured text, enabling accurate masking or substitution. It is particularly useful for entities that may not be easily tokenized, such as email addresses, URLs, or code snippets and when the entity length in the text must be preserved.

Masked Text Placeholder Function

The Placeholder Function is a key component in the p5y framework that replaces sensitive data in unstructured text with standardized placeholders. This function ensures that personally identifiable information (PII) is anonymized while maintaining the structure of the source text.

Tokenization and Placeholder Variables:

Entities such as names, amounts, and currencies are identified and replaced with tokens or placeholders. For example:

Tokens for First Names: FIRSTNAME_1, FIRSTNAME_2
Consistent Entity Representation: The same entity string value is consistently replaced with the same placeholder, e.g., FIRSTNAME_1 for every occurrence of "Bob".

Bob has 100 USD and Jenny has 100 USD.

[GIVENNAME_1] has [AMOUNT_1] [CURRENCY_1] and [GIVENNAME_2] has [AMOUNT_1] [CURRENCY_1].

In this example:

"Bob" is replaced with [GIVENNAME_1].
"Jenny" is replaced with [GIVENNAME_2].
"100 USD" is consistently replaced with [AMOUNT_1] [CURRENCY_1].

Good vs. Bad Masking Practices

Effective masking should remove identifiable information without distorting the text's meaning. Here's a comparison:

Bad Masking Example:

This is Mike -> This is M***

Partial masking like "M***" can still allow identification of the individual, especially if the context is included.

Good Masking Example:

This is Mike -> This is [GIVENNAME_1]

Placeholder Algorithm Variables

Privacy Token Separator: Characters used to enclose placeholders, e.g., [].
Joiner: Symbol used to connect placeholder label to its contextual occurrence count, e.g., _.

[GIVENNAME_1], where `[` and `]` are start and end token separators and `_` joins "GIVENNAME" and "1".

Masking (Privacy)

In the context of data privacy, masking refers to the process of hiding or obscuring specific pieces of sensitive or personally identifiable information (PII) within a dataset to protect it from unauthorized access or exposure. Masking is often used to prevent the disclosure of sensitive data while still allowing certain aspects of the data to be used for analysis or processing.

P

Personal Data

"Any information relating to an identified or identifiable natural person; an identifiable person is one who can be identified, directly or indirectly — in particular by reference to an identification number or to one or more factors specific to their physical, physiological, mental, economic, cultural or social identity.’" (as defined by the European Union, in the General Data Protection Regulation GDPR)

Examples: person name, email address, telephone number, religious and political affiliation, home address, occupation, medical condition, family status, account balance, criminal history.

Aliases: Personal Information, US: Personally Identifiable Information (PII)

Personally Identifiable Information (PII)

US - synonym for Personal data

"Any information about an individual, including any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and any other information that is linkable to an individual, such as medical, educational, financial, and employment information.’" (as defined by IAPP)

Precision

Precision, in machine learning, measures the proportion of true positives out of all instances that were predicted as positive. In other words, it reflects how accurately the model identifies relevant instances without misclassifying non-relevant ones.

In the context of privacy redaction, precision indicates how well the system avoids over-redacting non-sensitive information. A high precision means that when the system flags something as sensitive, it is usually correct, thus preserving the integrity of non-sensitive data. While recall focuses on catching all sensitive details, precision ensures that unnecessary information is not redacted, balancing the need for privacy with data utility.

Privacy

Privacy refers to the individual's right to control access to their personal information and the ability to manage how that information is collected, used, shared, or disclosed. It involves the protection of sensitive data, ensuring that individuals' personal details are kept confidential and are not misused or exposed without their consent.

In the digital age, privacy encompasses not just physical privacy but also data privacy, which relates to how organizations, governments, and individuals handle personally identifiable information (PII), such as names, addresses, email addresses, financial details, health records, and other data that could be used to identify or track a person.

Privacy Baseline

In the context of privacy, Privacy Foundation refers to the foundational understanding of privacy principles and expectations as derived from multiple stakeholders: governments, companies, and individuals. Privacy is inherently complex and dynamic, making it challenging to standardize because regulations vary widely across regions and contexts. Unlike other fields, privacy laws often do not provide a clear, uniform set of labels or definitions, leaving room for interpretation and adaptation based on the needs and perspectives of different parties.

Privacy Language

A data format for text data, where private information has a special encoding to distinguish it from the rest of the text. This format translates regular natural languages into a language that facilitates meeting privacy requirements and regulations for any use-case.

Privacy Language is part of the p5y standardized data privacy framework for text data privacy.

Privacy Mask Data Structure

A privacy mask is a layer applied to text that provides information about the sensitive or private data contained within it. It identifies and highlights specific portions of the text that may require protection, without altering the original content. This layer helps to ensure privacy compliance by marking data for redaction, anonymization, or restricted access based on privacy regulations and policies.

Within the privacy mask, there are a set of entities which:

Define Start and End Positions: It accurately determines the positions where sensitive entity values begin and end within the text.
Contain the Predicted Label and Its Associated Confidence: Each entity includes the predicted label and an optional confidence score.

Schema:

Privacy Mask Entity:
- Value in text (string)
- Label name (string)
- Start position of entity value (int)
- End position of entity value (int)
- Confidence (optional, float)

Example (Sourced from pii-masking-400k, entry id: 224845):

Source String: Report: 531 Dundridge Lane, SO32 1GD.

Privacy Mask Entity 1:

Value in text: "531"
Label name: "BUILDINGNUM"
Start position of entity value: 8
End position of entity value: 11
Confidence: 1

Privacy Mask Entity 2:

Value in text: "Dundridge Lane"
Label name: "STREET"
Start position of entity value: 12
End position of entity value: 26
Confidence: 1

Privacy Mask Entity 3:

Value in text: "SO32 1GD"
Label name: "ZIPCODE"
Start position of entity value: 28
End position of entity value: 36
Confidence: 1

Privacy Masking

Masking in the context of privacy refers to the process of obscuring or altering specific pieces of sensitive or personally identifiable information (PII) within a dataset or document to protect it from unauthorized access. The goal of masking is to retain the utility of the data for legitimate purposes (like analysis or processing) while safeguarding privacy and preventing the exposure of sensitive information.

Privacy Translation

Refers to the process of transforming data containing personal or sensitive information into a format that complies with privacy regulations and policies across different regions, industries, or systems. The goal of privacy translation is to ensure that personal data remains protected while being shared or processed in various contexts, much like how language translation allows information to be communicated across different languages.
Within the p5y framework, this process translates unstructured text data into the privacy mask data-structure format (Privacy Language).

Protection (Privacy)

In the context of data privacy, Protection refers to the process of controlling and safeguarding personal data identified in unstructured texts. This step involves determining which personal information should be removed, masked, or altered, and selecting the appropriate anonymization techniques, such as masking, pseudonymization, or k-anonymization.

Pseudonymization

Pseudonymization is a data privacy technique where personally identifiable information (PII) is replaced with artificial identifiers, or "pseudonyms," in such a way that the original data cannot be easily linked back to an individual without additional information. Unlike anonymization, which permanently removes the link to the original identity, pseudonymization allows for the possibility of re-identifying the individual under controlled conditions by using a separate "key" or reference data.

Q

Quality Assurance (Privacy)

In the context of data privacy, Quality Assurance refers to the final step in the data anonymization process, where the effectiveness of the anonymization is evaluated. This step measures the remaining privacy risks, ensuring that the target entities have been properly anonymized and assessing the likelihood of de-anonymization.

The process involves both expert human annotation and automated models to identify any potential vulnerabilities or risks of re-identifying anonymized data. This step ensures that the anonymization meets the required standards and that the data is safe for use, reducing the risk of exposing sensitive information.

R

Recall

Recall, in the context of machine learning, measures the proportion of true positives correctly identified out of all actual positive instances. In simpler terms, it indicates how well a model detects all relevant instances of a particular class (e.g., personal information in privacy redaction).

Recall is especially important for privacy redaction because the primary goal is to ensure that all personal information is detected and removed if necessary. A high recall means fewer instances of private data are left unredacted, reducing the risk of exposure. In privacy-focused tasks, missing sensitive details (false negatives) can be far more damaging than accidentally over-redacting non-sensitive information, making recall a critical metric for evaluating the efficacy of redaction technologies.

Rest-risiko (privacy)

Rest-Risiko, or residual risk, refers to the remaining risk to privacy that persists after anonymization or redaction of personal data has been performed. Even after measures are taken to protect sensitive information, there may still be potential vulnerabilities or identifiable patterns that could lead to re-identification of individuals or exposure of sensitive data.

Rest-risiko for privacy involves potential residual threats from anonymization efforts, as techniques like re-identification and data linking can still expose personal information; the effectiveness of anonymization methods can vary, the context of data usage may affect identifiability, and organizations must regularly assess these risks to ensure compliance with data protection regulations and maintain effective governance frameworks.

In summary, rest-risiko highlights the importance of ongoing vigilance and assessment in data protection strategies, ensuring that residual risks are adequately managed even after anonymization and redaction processes are implemented.

Risk Tolerance

Risk tolerance regarding residual risk (Rest-Risiko) for privacy refers to the level of acceptable risk an organization is willing to accept after implementing measures to anonymize or protect personal data. It reflects the balance between the potential benefits of data usage and the risks associated with the possibility of re-identifying individuals or exposing sensitive information.

Key aspects of risk tolerance to Rest-Risiko for privacy include defining assessment criteria for evaluating the impact of privacy breaches, involving stakeholder input for a comprehensive understanding of risks, conducting cost-benefit analyses to justify acceptable residual risks, recognizing that risk tolerance can evolve with changing regulations and technology, and establishing mitigation strategies and contingency plans to manage residual risks effectively.

S

Sensitive Personal Information (SPI)

Any personal data that could potentially cause harm, damage, embarrassment, or discrimination to an individual or a group, if it is disclosed, accessed, or used. This includes data that could result in the development of unfair models or biased human decisions. Note: it is not necessary for the individual to be identifiable.

An alternative interpretation, considers SPI as personal data that is generally considered as more private, such as medical and financial information (see IAPP)

Examples: gender, economic status, nationality, medical condition

Aliases: Sensitive Data, Sensitive Information

Source Texts

The Source Text, often referred to as the "unmasked text," is the original unstructured text data that has not undergone any anonymization or masking. These texts are raw data entries that may contain sensitive or personally identifiable information (PII).

Importance of Source Texts in Privacy Masking:

Identification of PII: Source texts are analyzed to detect PII that needs to be masked or anonymized to protect individual privacy.
Annotations for Machine Learning: The associated columns like token labels and span labels are crucial for training machine learning models for tasks such as NER and span extraction.
Facilitating Data Processing Pipelines: Including metadata like locale, language, and split aids in organizing and processing data efficiently.

Context:

Entries and Datasets: Source texts are individual data entries that are part of larger datasets. Each entry typically includes the source text along with additional metadata and annotations.
Associated Columns: In datasets, source texts are often accompanied by various columns that provide additional information or processing results.

Common Columns Associated with Source Texts:

Source Text ID:

Description: A unique identifier for each source text entry.
Purpose: Facilitates tracking and referencing specific texts within the dataset.

Target Text:

Description: The processed or transformed version of the source text, such as a translated, paraphrased, or masked version.
Purpose: Used for tasks like machine translation, text summarization, or privacy masking.

Tokenized Source Text:

Description: The source text broken down into individual tokens (e.g., words, subwords, or characters).
Purpose: Useful for NLP tasks that require token-level analysis, such as token classification or language modeling.

Locale:

Description: The regional or cultural context of the text, often specified using locale codes (e.g., en-US for English - United States).
Purpose: Important for localization, regional analysis, and applying locale-specific processing rules.

Language:

Description: The language of the source text (e.g., English, Spanish).
Purpose: Enables language-specific processing and analysis.

Split:

Description: Indicates the dataset partition to which the source text belongs (e.g., train, validation, test).
Purpose: Used to organize data for machine learning workflows, ensuring proper training and evaluation.

Token Labels:

Description: Labels assigned to each token in the tokenized source text, often used for token classification tasks.
Purpose: Supports tasks like Named Entity Recognition (NER) by identifying the role or category of each token.

Span Labels:

Description: Annotations that identify specific spans (substrings) within the source text and assign labels to them.
Purpose: Used for span extraction tasks, where the goal is to locate and classify specific segments of text.

Privacy Mask:

Description: A data structure that contains information about sensitive entities within the source text that need to be masked or anonymized.
Purpose: Facilitates privacy compliance by identifying and processing PII within the text.

Structured Text

Structured text refers to data that is organized in a predefined format, making it easily searchable and analyzable. This type of text is typically arranged in tables, spreadsheets, or databases, where information is stored in rows and columns with defined fields or labels (e.g., names, dates, categories). Each piece of data follows a specific format, such as a phone number, email address, or product ID, and fits neatly into its corresponding place within a schema.

Because structured text is highly organized, it can be easily processed by algorithms or queried using tools like SQL. Unlike unstructured text, structured text requires little to no preprocessing to analyze or extract useful information.

Synthetic Identities

Synthetic identities are entirely fictional profiles created by combining made-up personal information. In data privacy, these identities use elements like invented names, addresses, and other details but are not linked to any real individual. A good synthetic identity is generated randomly, ensuring that the chance of the combination of its elements (name, address, telephone number, etc.) matching those of an actual person is practically 0%.

U

Unstructured Text

Unstructured text refers to data or information that does not follow a predefined format or organizational model. Unlike structured data, which is neatly organized in tables or databases (e.g., rows and columns), unstructured text can appear in free-form formats such as emails, social media posts, articles, books, transcripts, or web pages. It lacks the inherent structure that makes it easy to categorize or analyze without preprocessing.

Dealing with unstructured text presents challenges because it often contains irregularities, ambiguities, and variations in language, requiring natural language processing (NLP) techniques to extract meaningful insights, patterns, or structure, such as to identify email addresses, person names or bank account numbers for privacy redaction.

P

p5y

p5y is a standardized data privacy framework to safely handle any unstructured text that contains personally identifiable and sensitive information.
It does so by translating data into its privacy language version, that can then be easily customized for different use-cases.

The methodology and tools provided facilitate compliance with regulations like GDPR, HIPAA, and EU AI Act.