3. Metrics¶

FairLangProc provides comprehensive fairness metrics to measure discrimination in NLP models.

3.1. Supported Metrics¶

FairLangProc supports different fairness metrics to measure discrimination in NLP. Broadly, they can be classified into three categories:

Embedding metrics (WEAT, SEAT): if they measure bias by examining the model’s hidden representations of input text.
Probability metrics (LPBS, CBS, CPS, AUL): if they measure bias by computing the probabilities of certain tokens or sentences.
Generated text metrics (DR, SA, HONEST): if they measure bias by examining text generated by the model, looking for harmful or stereotypical words.

The implemented metrics are:

Generalized association tests (WEAT) (Caliskan et al., 2016).
Log Probability Bias Score (LPBS) (Kurita et al., 2019).
Categorical Bias Score (CBS) (Ahn et al., 2021).
CrowS-Pairs Score (CPS) (Nangia et al., 2020).
All Unmasked Score (AUL) (Kaneko et al., 2021).
Demographic Representation (DR) (Liang et al., 2022).
Stereotypical Association (SA) (Liang et al., 2022).
HONEST (Nozza et al., 2021).

3.2. Embedding Metrics¶

3.2.1. WEAT¶

The most famous embedding metric is given the Word Embedding Association Test (WEAT) (Caliskan et al., 2016), which aims to measure associations between demographic and neutral attributes. Demographic attributes are usually binary and denoted by \(A_1, A_2\), denoting two different societal groups (male and female, christians and atheist,…). Neutral attributes, on the other hand, are denoted by \(W_1, W_2\) and represent two different stereotypes whose demographic association we are interested in.

\[s(a, W_1, W_2) = \sum_{w_1\in \mathbb{W}_1} \frac{\cos(a, w_1)}{|\mathbb{W}_1|} - \sum_{w_2\in \mathbb{W}_2} \frac{\cos(a, w_2)}{|\mathbb{W}_2|},\]

\[WEAT(A_1, A_2, W_1, W_2) = \frac{\sum_{a_1 \in A_1} s(a_1, W_1, W_2)/ |A_1| - \sum_{a_2 \in A_2} s(a_2, W_1, w_2)/ |A_2| }{\text{std}_{a\in A_1 \cup A_2} s(a, W_1, W_2)}.\]

class FairLangProc.metrics.embedding.WEAT(model: Module, tokenizer: TokenizerType, device: str = 'cuda')[source]

Class for handling WEAT metric with a PyTorch model and tokenizer.

model

PyTorch model (e.g., BERT, GPT from HuggingFace).

Type:: nn.Module

tokenizer

Tokenizer for the model.

Type:: TokenizerType

device

Device to run the WEAT test on.

Type:: str

metric(W1_words, W2_words, A1_words, A2_words, n_perm, pval)[source]: Computation of the WEAT effect size between W1, W2 and A1, A2.

_get_embedding(outputs)[source]: Abstract method whose implementation is required and which aims to compute the embedding of an output given by the model.

__init__(model: Module, tokenizer: TokenizerType, device: str = 'cuda') → None[source]

Constructor for the WEAT class

Parameters:

model (nn.Module) – PyTorch model (e.g., BERT, GPT from HuggingFace).
tokenizer (TokenizerType) – Tokenizer for the model.
device (str) – Device to run the WEAT test on.

abstract _get_embedding(outputs)[source]: Abstract method that instructs the class on how to obtain the embedding of a given input.

metric(W1_words: list[str], W2_words: list[str], A1_words: list[str], A2_words: list[str], n_perm: int = 10000, pval: bool = True) → dict[str, float][source]

Run WEAT test.

Parameters:

W1_words (list[str]) – Target concept 1 words/sentences
W2_words (list[str]) – Target concept 2 words
A1_words (list[str]) – Attribute 1 words/sentences
A2_words (list[str]) – Attribute 2 words/sentences
n_perm (int) – Number of permutations for p-value
pval (bool) – Whether to compute or not the p-value

Returns:

results – Dictionary with test results, namely mean similarity between W1, W2 and A1, A2; their sizes, the WEAT effect size and the p-value if needed.

Return type:

dict[str, float]

3.3. Probability Metrics¶

3.3.1. LPBS¶

LPBS (Kurita et al., 2019) measures bias for a binary demographic group.

\[\text{LPBS} = \log\frac{p_1}{p_{prior, 1}} - \log\frac{p_2}{p_{prior, 2}}.\]

FairLangProc.metrics.probability.LPBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int] | None = None) → Tensor[source]

Computes LPBS score for a list of tuples of dimension 2 of target words.

Parameters:

model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model.
sentences (list[str]) – List of sentences with masks.
target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute.
fill_words (list[str]) – List of words which replace the secondary mask.
mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1)).

Returns:

probs – List of LPBS scores

Return type:

torch.Tensor

Example

>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentences = ["[MASK] is a [MASK].", "[MASK] is a [MASK].", "The [MASK] was a [MASK]."]
>>> target_words = [("John", "Mary"), ("He", "She"), ("man", "woman")]
>>> fill_words = ["engineer","nurse","doctor"]
>>> mask_indices = [0, 0, 1]
>>>
>>> LPBSscore = LPBS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words,
...     fill_words = fill_words,
...     mask_indices = mask_indices
... )

3.3.2. CBS¶

CBS (Ahn et al., 2021) generalizes measurement of bias for non-binary demographic groups.

\[\text{CBS} = \text{Var}_{a\in \mathbb{A}}\log\frac{p_a}{p_{prior, a}}.\]

FairLangProc.metrics.probability.CBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int]) → Tensor[source]

Computes CBS score for a list of tuples of dimension n of target words.

Parameters:

model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model
sentences (list[str]) – List of sentences with masks
target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute
fill_words (list[str]) – List of words which replace the secondary mask
mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1))

Returns:

probs – List of CBS scores

Return type:

torch.Tensor

Example

>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> target_words = [("John", "Mamadouk", "Liu"), ("white", "black", "asian"), ("white", "black", "asian")]
>>> sentences = ["[MASK] is a [MASK]", "The [MASK] kid got [MASK] results", "The [MASK] kid wanted to be a [MASK]"]
>>> fill_words = ["engineer", "outstanding", "doctor"]
>>> mask_indices = [0, 1, 1]
>>>
>>> CBSscore = CBS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words,
...     fill_words = fill_words,
...     mask_indices = mask_indices
... )

3.3.3. CPS¶

CPS (Nangia et al., 2020) uses sentence pairs which coincide in a series of unmodified tokens.

\[\text{CPS}(S) = \sum_{u\in U} \log \mathbb{P}(u| U_{\backslash u}, A),\]

FairLangProc.metrics.probability.CPS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[str]) → list[float][source]

Computes the CPS score for list of sentences.

Parameters:

model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model.
sentences (list[str]) – List of sentences for whom we will compute the CPS score.
target_words (list[str]) – List of target words which should not be masked.

Returns:

score – List of CPS score of the sentences.

Return type:

list[float]

Example

>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman']
>>> target_words = ['actor', 'actress', 'man', 'woman']
>>>
>>> CPSscore = CPS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words
... )

3.3.4. AUL¶

AUL (Kaneko et al., 2021) predicts the probability of all tokens in the sentence without masking.

\[\text{AUL}(S) = \frac{1}{|S|} \sum_{s\in S} \log \mathbb{P}(s|S).\]

FairLangProc.metrics.probability.AUL(model: Module, tokenizer: TokenizerType, sentences: list[str]) → list[float][source]

Computes the AUL score for list of sentences.

Parameters:

model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model.
sentences (list[str]) – List of sentences for whom we will compute the AUL score.

Returns:

score – List of AUL score of the sentences.

Return type:

list[float]

Example

>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman']
>>>
>>> AULscore = AUL(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences
... )

3.4. Generated Text Metrics¶

3.4.1. Demographic Representation¶

DR (Liang et al., 2022) is computed as follows:

\[\text{DR}(a) = \sum_{w_i \in \mathbb{A}}\sum_{\hat{Y} \in \hat{\mathbb{Y}}} C(w_i, \hat{Y}),\]

FairLangProc.metrics.generated_text.DemRep(demWords: dict[str, list[str]], sentences: list[str]) → dict[str, int][source]

Computes Demographic representation.

Parameters:

demWords (dict[str, list[str]]) – Dictionary whose keys represent demographic attributes and whose values represent words with demographic meaning.
sentences (list[str]) – List of sentences to run the demographic representation.

Returns:

demRepVect – Dictionary with demographic counts for all considered words and sentences.

Return type:

dict[str, int]

Example

>>> gendered_words = {
...     'male': ['he', 'him', 'his'],
...     'female': ['she', 'her', 'actress', 'hers']
...     }
>>> sentences = [
...     'She is such a good match to him.',
...     'He is trying way too hard to be an actor.',
...     'Her mother is trying to make ends meet.'
...     'My aunt is baking, do you want to try?'
...     ]
>>>
>>> DR = DemRep(
        sentences = sentences,
        demWords = gendered_words
        )

3.4.2. Stereotypical Association¶

SA (Liang et al., 2022) is computed as follows:

\[\text{DR}(a) = \sum_{w_i \in \mathbb{A}}\sum_{\hat{Y} \in \hat{\mathbb{Y}}} C(w_i, \hat{Y}),\]

FairLangProc.metrics.generated_text.StereoAsoc(targetWords: list[str], demWords: dict[str, list[str]], sentences: list[str]) → dict[str, dict[str, int]][source]

Computes Stereotypical Association

Parameters:

targetWords (list[str]) – List of words whose associations we want to compute.
demWords (dict[str, list[str]]) – Dictionary whose keys represent demographic attributes and whose values represent words with demographic meaning.
sentences (list[str]) – List of sentences to run the stereotypical association.

Returns:

steAsocVect – Dictionary which stores demographic counts for all considered words and sentences indexed by targetWords.

Return type:

dict

Example

>>> gendered_words = {
...     'male': ['he', 'him', 'his'],
...     'female': ['she', 'her', 'actress', 'hers']
...     }
>>> sentences = [
...     'She is such a good match to him.',
...     'He is trying way too hard to be an actor.',
...     'Her mother is trying to make ends meet.'
...     'My aunt is baking, do you want to try?'
...     ]
>>> target_words = ['mother', 'baking']
>>>
>>> ST = StereoAsoc(
        sentences = sentences,
        demWords = gendered_words,
        targetWords = target_words
        )

3.4.3. HONEST¶

HONEST (Nozza et al., 2021) measures how many of the top \(k\) completions contain harmful words.

\[\text{HONEST}(\hat{\mathbb{Y}} ) = \frac{\sum_{\hat{Y}_k \in\hat{\mathbb{Y}}_k} \sum_{\hat{y} \in \hat{Y}_k} \mathbf{1}(\hat{y} \in \mathbb{Y}_{hurt} ) }{|\mathbb{\hat{Y}}| k}.\]

FairLangProc.metrics.generated_text.HONEST(completions: list[list[str]], hurtLex: list[str]) → float[source]

Computes HONEST score for a set of completions given a hurt lexicon.

Parameters:

completions (list[list[str]]) – List of completions given by a model. This is a list of lists: it should contain lists of completions of the same size. The lists of completions are lists of constant size containing the completions of a given model for a certain prompt.
hurtLex (list[str]) – List of words belonging to the hurt lexicon.

Returns:

honestScore – HONEST score for the given sentences and hurt lexicon.

Return type:

float

Example

>>> tokenizer = AutoTokenizer.from_pretrained('gpt2')
>>> model = AutoModelForCausalLM.from_pretrained('gpt2')
>>> corpus = [
...     'I cannot believe she did that, she is so',
...     'I cannot believe he did that, he is so',
...     'She tries and she tries to no avail. She is such a',
...     'He tries and she tries to no avail. He is such a',
...     ]
>>> completions = []
>>> for prompt in corpus:
...     input_ids = tokenizer.encode(prompt, return_tensors="pt")
...     outputs = model.generate(
...         input_ids,
...         max_length=30,
...         num_return_sequences=3,
...         do_sample=True,
...         early_stopping=True
...     )
...     completions.append([tokenizer.decode(output, skip_special_tokens=True) for output in outputs])
>>> hurtLex = ['fool', 'bitch', 'stupid', 'incompetent', 'idiot', 'dumb']
>>>
>>> honestScore = HONEST(
...     completions = completions,
...     hurtLex = hurtLex
... )

3. Metrics¶

3.1. Supported Metrics¶

3.2. Embedding Metrics¶

3.2.1. WEAT¶

3.3. Probability Metrics¶

3.3.1. LPBS¶

3.3.2. CBS¶

3.3.3. CPS¶

3.3.4. AUL¶

3.4. Generated Text Metrics¶

3.4.1. Demographic Representation¶

3.4.2. Stereotypical Association¶

3.4.3. HONEST¶

FairLangProc

Navigation

Related Topics