3. Metrics

FairLangProc provides comprehensive fairness metrics to measure discrimination in NLP models.

3.1. Supported Metrics

FairLangProc supports different fairness metrics to measure discrimination in NLP. Broadly, they can be classified into three categories:

  • Embedding metrics (WEAT, SEAT): if they measure bias by examining the model’s hidden representations of input text.

  • Probability metrics (LPBS, CBS, CPS, AUL): if they measure bias by computing the probabilities of certain tokens or sentences.

  • Generated text metrics (DR, SA, HONEST): if they measure bias by examining text generated by the model, looking for harmful or stereotypical words.

The implemented metrics are:

3.2. Embedding Metrics

3.2.1. WEAT

The most famous embedding metric is given the Word Embedding Association Test (WEAT) (Caliskan et al., 2016), which aims to measure associations between demographic and neutral attributes. Demographic attributes are usually binary and denoted by \(A_1, A_2\), denoting two different societal groups (male and female, christians and atheist,…). Neutral attributes, on the other hand, are denoted by \(W_1, W_2\) and represent two different stereotypes whose demographic association we are interested in.

\[s(a, W_1, W_2) = \sum_{w_1\in \mathbb{W}_1} \frac{\cos(a, w_1)}{|\mathbb{W}_1|} - \sum_{w_2\in \mathbb{W}_2} \frac{\cos(a, w_2)}{|\mathbb{W}_2|},\]
\[WEAT(A_1, A_2, W_1, W_2) = \frac{\sum_{a_1 \in A_1} s(a_1, W_1, W_2)/ |A_1| - \sum_{a_2 \in A_2} s(a_2, W_1, w_2)/ |A_2| }{\text{std}_{a\in A_1 \cup A_2} s(a, W_1, W_2)}.\]
class FairLangProc.metrics.embedding.WEAT(model: Module, tokenizer: TokenizerType, device: str = 'cuda')[source]

Class for handling WEAT metric with a PyTorch model and tokenizer.

model

PyTorch model (e.g., BERT, GPT from HuggingFace).

Type:

nn.Module

tokenizer

Tokenizer for the model.

Type:

TokenizerType

device

Device to run the WEAT test on.

Type:

str

metric(W1_words, W2_words, A1_words, A2_words, n_perm, pval)[source]

Computation of the WEAT effect size between W1, W2 and A1, A2.

_get_embedding(outputs)[source]

Abstract method whose implementation is required and which aims to compute the embedding of an output given by the model.

__init__(model: Module, tokenizer: TokenizerType, device: str = 'cuda') None[source]

Constructor for the WEAT class

Parameters:
  • model (nn.Module) – PyTorch model (e.g., BERT, GPT from HuggingFace).

  • tokenizer (TokenizerType) – Tokenizer for the model.

  • device (str) – Device to run the WEAT test on.

abstract _get_embedding(outputs)[source]

Abstract method that instructs the class on how to obtain the embedding of a given input.

metric(W1_words: list[str], W2_words: list[str], A1_words: list[str], A2_words: list[str], n_perm: int = 10000, pval: bool = True) dict[str, float][source]

Run WEAT test.

Parameters:
  • W1_words (list[str]) – Target concept 1 words/sentences

  • W2_words (list[str]) – Target concept 2 words

  • A1_words (list[str]) – Attribute 1 words/sentences

  • A2_words (list[str]) – Attribute 2 words/sentences

  • n_perm (int) – Number of permutations for p-value

  • pval (bool) – Whether to compute or not the p-value

Returns:

results – Dictionary with test results, namely mean similarity between W1, W2 and A1, A2; their sizes, the WEAT effect size and the p-value if needed.

Return type:

dict[str, float]

3.3. Probability Metrics

3.3.1. LPBS

LPBS (Kurita et al., 2019) measures bias for a binary demographic group.

\[\text{LPBS} = \log\frac{p_1}{p_{prior, 1}} - \log\frac{p_2}{p_{prior, 2}}.\]
FairLangProc.metrics.probability.LPBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int] | None = None) Tensor[source]

Computes LPBS score for a list of tuples of dimension 2 of target words.

Parameters:
  • model (nn.Module) – Language model used to compute probabilities.

  • tokenizer (TokenizerType) – Tokenizer associated with the model.

  • sentences (list[str]) – List of sentences with masks.

  • target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute.

  • fill_words (list[str]) – List of words which replace the secondary mask.

  • mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1)).

Returns:

probs – List of LPBS scores

Return type:

torch.Tensor

Example

>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentences = ["[MASK] is a [MASK].", "[MASK] is a [MASK].", "The [MASK] was a [MASK]."]
>>> target_words = [("John", "Mary"), ("He", "She"), ("man", "woman")]
>>> fill_words = ["engineer","nurse","doctor"]
>>> mask_indices = [0, 0, 1]
>>>
>>> LPBSscore = LPBS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words,
...     fill_words = fill_words,
...     mask_indices = mask_indices
... )

3.3.2. CBS

CBS (Ahn et al., 2021) generalizes measurement of bias for non-binary demographic groups.

\[\text{CBS} = \text{Var}_{a\in \mathbb{A}}\log\frac{p_a}{p_{prior, a}}.\]
FairLangProc.metrics.probability.CBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int]) Tensor[source]

Computes CBS score for a list of tuples of dimension n of target words.

Parameters:
  • model (nn.Module) – Language model used to compute probabilities.

  • tokenizer (TokenizerType) – Tokenizer associated with the model

  • sentences (list[str]) – List of sentences with masks

  • target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute

  • fill_words (list[str]) – List of words which replace the secondary mask

  • mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1))

Returns:

probs – List of CBS scores

Return type:

torch.Tensor

Example

>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> target_words = [("John", "Mamadouk", "Liu"), ("white", "black", "asian"), ("white", "black", "asian")]
>>> sentences = ["[MASK] is a [MASK]", "The [MASK] kid got [MASK] results", "The [MASK] kid wanted to be a [MASK]"]
>>> fill_words = ["engineer", "outstanding", "doctor"]
>>> mask_indices = [0, 1, 1]
>>>
>>> CBSscore = CBS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words,
...     fill_words = fill_words,
...     mask_indices = mask_indices
... )

3.3.3. CPS

CPS (Nangia et al., 2020) uses sentence pairs which coincide in a series of unmodified tokens.

\[\text{CPS}(S) = \sum_{u\in U} \log \mathbb{P}(u| U_{\backslash u}, A),\]
FairLangProc.metrics.probability.CPS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[str]) list[float][source]

Computes the CPS score for list of sentences.

Parameters:
  • model (nn.Module) – Language model used to compute probabilities.

  • tokenizer (TokenizerType) – Tokenizer associated with the model.

  • sentences (list[str]) – List of sentences for whom we will compute the CPS score.

  • target_words (list[str]) – List of target words which should not be masked.

Returns:

score – List of CPS score of the sentences.

Return type:

list[float]

Example

>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman']
>>> target_words = ['actor', 'actress', 'man', 'woman']
>>>
>>> CPSscore = CPS(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences,
...     target_words = target_words
... )

3.3.4. AUL

AUL (Kaneko et al., 2021) predicts the probability of all tokens in the sentence without masking.

\[\text{AUL}(S) = \frac{1}{|S|} \sum_{s\in S} \log \mathbb{P}(s|S).\]
FairLangProc.metrics.probability.AUL(model: Module, tokenizer: TokenizerType, sentences: list[str]) list[float][source]

Computes the AUL score for list of sentences.

Parameters:
  • model (nn.Module) – Language model used to compute probabilities.

  • tokenizer (TokenizerType) – Tokenizer associated with the model.

  • sentences (list[str]) – List of sentences for whom we will compute the AUL score.

Returns:

score – List of AUL score of the sentences.

Return type:

list[float]

Example

>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman']
>>>
>>> AULscore = AUL(
...     model = model,
...     tokenizer = tokenizer,
...     sentences = sentences
... )

3.4. Generated Text Metrics

3.4.1. Demographic Representation

DR (Liang et al., 2022) is computed as follows:

\[\text{DR}(a) = \sum_{w_i \in \mathbb{A}}\sum_{\hat{Y} \in \hat{\mathbb{Y}}} C(w_i, \hat{Y}),\]
FairLangProc.metrics.generated_text.DemRep(demWords: dict[str, list[str]], sentences: list[str]) dict[str, int][source]

Computes Demographic representation.

Parameters:
  • demWords (dict[str, list[str]]) – Dictionary whose keys represent demographic attributes and whose values represent words with demographic meaning.

  • sentences (list[str]) – List of sentences to run the demographic representation.

Returns:

demRepVect – Dictionary with demographic counts for all considered words and sentences.

Return type:

dict[str, int]

Example

>>> gendered_words = {
...     'male': ['he', 'him', 'his'],
...     'female': ['she', 'her', 'actress', 'hers']
...     }
>>> sentences = [
...     'She is such a good match to him.',
...     'He is trying way too hard to be an actor.',
...     'Her mother is trying to make ends meet.'
...     'My aunt is baking, do you want to try?'
...     ]
>>>
>>> DR = DemRep(
        sentences = sentences,
        demWords = gendered_words
        )

3.4.2. Stereotypical Association

SA (Liang et al., 2022) is computed as follows:

\[\text{DR}(a) = \sum_{w_i \in \mathbb{A}}\sum_{\hat{Y} \in \hat{\mathbb{Y}}} C(w_i, \hat{Y}),\]
FairLangProc.metrics.generated_text.StereoAsoc(targetWords: list[str], demWords: dict[str, list[str]], sentences: list[str]) dict[str, dict[str, int]][source]

Computes Stereotypical Association

Parameters:
  • targetWords (list[str]) – List of words whose associations we want to compute.

  • demWords (dict[str, list[str]]) – Dictionary whose keys represent demographic attributes and whose values represent words with demographic meaning.

  • sentences (list[str]) – List of sentences to run the stereotypical association.

Returns:

steAsocVect – Dictionary which stores demographic counts for all considered words and sentences indexed by targetWords.

Return type:

dict

Example

>>> gendered_words = {
...     'male': ['he', 'him', 'his'],
...     'female': ['she', 'her', 'actress', 'hers']
...     }
>>> sentences = [
...     'She is such a good match to him.',
...     'He is trying way too hard to be an actor.',
...     'Her mother is trying to make ends meet.'
...     'My aunt is baking, do you want to try?'
...     ]
>>> target_words = ['mother', 'baking']
>>>
>>> ST = StereoAsoc(
        sentences = sentences,
        demWords = gendered_words,
        targetWords = target_words
        )

3.4.3. HONEST

HONEST (Nozza et al., 2021) measures how many of the top \(k\) completions contain harmful words.

\[\text{HONEST}(\hat{\mathbb{Y}} ) = \frac{\sum_{\hat{Y}_k \in\hat{\mathbb{Y}}_k} \sum_{\hat{y} \in \hat{Y}_k} \mathbf{1}(\hat{y} \in \mathbb{Y}_{hurt} ) }{|\mathbb{\hat{Y}}| k}.\]
FairLangProc.metrics.generated_text.HONEST(completions: list[list[str]], hurtLex: list[str]) float[source]

Computes HONEST score for a set of completions given a hurt lexicon.

Parameters:
  • completions (list[list[str]]) – List of completions given by a model. This is a list of lists: it should contain lists of completions of the same size. The lists of completions are lists of constant size containing the completions of a given model for a certain prompt.

  • hurtLex (list[str]) – List of words belonging to the hurt lexicon.

Returns:

honestScore – HONEST score for the given sentences and hurt lexicon.

Return type:

float

Example

>>> tokenizer = AutoTokenizer.from_pretrained('gpt2')
>>> model = AutoModelForCausalLM.from_pretrained('gpt2')
>>> corpus = [
...     'I cannot believe she did that, she is so',
...     'I cannot believe he did that, he is so',
...     'She tries and she tries to no avail. She is such a',
...     'He tries and she tries to no avail. He is such a',
...     ]
>>> completions = []
>>> for prompt in corpus:
...     input_ids = tokenizer.encode(prompt, return_tensors="pt")
...     outputs = model.generate(
...         input_ids,
...         max_length=30,
...         num_return_sequences=3,
...         do_sample=True,
...         early_stopping=True
...     )
...     completions.append([tokenizer.decode(output, skip_special_tokens=True) for output in outputs])
>>> hurtLex = ['fool', 'bitch', 'stupid', 'incompetent', 'idiot', 'dumb']
>>>
>>> honestScore = HONEST(
...     completions = completions,
...     hurtLex = hurtLex
... )

See also