3. Metrics¶
FairLangProc provides comprehensive fairness metrics to measure discrimination in NLP models.
3.1. Supported Metrics¶
FairLangProc supports different fairness metrics to measure discrimination in NLP. Broadly, they can be classified into three categories:
Embedding metrics (WEAT, SEAT): if they measure bias by examining the model’s hidden representations of input text.
Probability metrics (LPBS, CBS, CPS, AUL): if they measure bias by computing the probabilities of certain tokens or sentences.
Generated text metrics (DR, SA, HONEST): if they measure bias by examining text generated by the model, looking for harmful or stereotypical words.
The implemented metrics are:
Generalized association tests (WEAT) (Caliskan et al., 2016).
3.2. Embedding Metrics¶
3.2.1. WEAT¶
The most famous embedding metric is given the Word Embedding Association Test (WEAT) (Caliskan et al., 2016), which aims to measure associations between demographic and neutral attributes. Demographic attributes are usually binary and denoted by \(A_1, A_2\), denoting two different societal groups (male and female, christians and atheist,…). Neutral attributes, on the other hand, are denoted by \(W_1, W_2\) and represent two different stereotypes whose demographic association we are interested in.
- class FairLangProc.metrics.embedding.WEAT(model: Module, tokenizer: TokenizerType, device: str = 'cuda')[source]
Class for handling WEAT metric with a PyTorch model and tokenizer.
- model
PyTorch model (e.g., BERT, GPT from HuggingFace).
- Type:
nn.Module
- tokenizer
Tokenizer for the model.
- Type:
TokenizerType
- device
Device to run the WEAT test on.
- Type:
- metric(W1_words, W2_words, A1_words, A2_words, n_perm, pval)[source]
Computation of the WEAT effect size between W1, W2 and A1, A2.
- _get_embedding(outputs)[source]
Abstract method whose implementation is required and which aims to compute the embedding of an output given by the model.
- __init__(model: Module, tokenizer: TokenizerType, device: str = 'cuda') None[source]
Constructor for the WEAT class
- Parameters:
model (nn.Module) – PyTorch model (e.g., BERT, GPT from HuggingFace).
tokenizer (TokenizerType) – Tokenizer for the model.
device (str) – Device to run the WEAT test on.
- abstract _get_embedding(outputs)[source]
Abstract method that instructs the class on how to obtain the embedding of a given input.
- metric(W1_words: list[str], W2_words: list[str], A1_words: list[str], A2_words: list[str], n_perm: int = 10000, pval: bool = True) dict[str, float][source]
Run WEAT test.
3.3. Probability Metrics¶
3.3.1. LPBS¶
LPBS (Kurita et al., 2019) measures bias for a binary demographic group.
- FairLangProc.metrics.probability.LPBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int] | None = None) Tensor[source]
Computes LPBS score for a list of tuples of dimension 2 of target words.
- Parameters:
model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model.
target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute.
fill_words (list[str]) – List of words which replace the secondary mask.
mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1)).
- Returns:
probs – List of LPBS scores
- Return type:
Example
>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased') >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') >>> sentences = ["[MASK] is a [MASK].", "[MASK] is a [MASK].", "The [MASK] was a [MASK]."] >>> target_words = [("John", "Mary"), ("He", "She"), ("man", "woman")] >>> fill_words = ["engineer","nurse","doctor"] >>> mask_indices = [0, 0, 1] >>> >>> LPBSscore = LPBS( ... model = model, ... tokenizer = tokenizer, ... sentences = sentences, ... target_words = target_words, ... fill_words = fill_words, ... mask_indices = mask_indices ... )
3.3.2. CBS¶
CBS (Ahn et al., 2021) generalizes measurement of bias for non-binary demographic groups.
- FairLangProc.metrics.probability.CBS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[tuple[str]], fill_words: list[str], mask_indices: list[int]) Tensor[source]
Computes CBS score for a list of tuples of dimension n of target words.
- Parameters:
model (nn.Module) – Language model used to compute probabilities.
tokenizer (TokenizerType) – Tokenizer associated with the model
target_words (list[tuple[str]]) – List containing tuples of words whose probabilities we want to compute
fill_words (list[str]) – List of words which replace the secondary mask
mask_indices (list[int]) – List of indices which indicate to which mask of the sentence each target word corresponds (i.e. first (0) or second (1))
- Returns:
probs – List of CBS scores
- Return type:
Example
>>> model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased') >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') >>> target_words = [("John", "Mamadouk", "Liu"), ("white", "black", "asian"), ("white", "black", "asian")] >>> sentences = ["[MASK] is a [MASK]", "The [MASK] kid got [MASK] results", "The [MASK] kid wanted to be a [MASK]"] >>> fill_words = ["engineer", "outstanding", "doctor"] >>> mask_indices = [0, 1, 1] >>> >>> CBSscore = CBS( ... model = model, ... tokenizer = tokenizer, ... sentences = sentences, ... target_words = target_words, ... fill_words = fill_words, ... mask_indices = mask_indices ... )
3.3.3. CPS¶
CPS (Nangia et al., 2020) uses sentence pairs which coincide in a series of unmodified tokens.
- FairLangProc.metrics.probability.CPS(model: Module, tokenizer: TokenizerType, sentences: list[str], target_words: list[str]) list[float][source]
Computes the CPS score for list of sentences.
- Parameters:
- Returns:
score – List of CPS score of the sentences.
- Return type:
Example
>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") >>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman'] >>> target_words = ['actor', 'actress', 'man', 'woman'] >>> >>> CPSscore = CPS( ... model = model, ... tokenizer = tokenizer, ... sentences = sentences, ... target_words = target_words ... )
3.3.4. AUL¶
AUL (Kaneko et al., 2021) predicts the probability of all tokens in the sentence without masking.
- FairLangProc.metrics.probability.AUL(model: Module, tokenizer: TokenizerType, sentences: list[str]) list[float][source]
Computes the AUL score for list of sentences.
- Parameters:
- Returns:
score – List of AUL score of the sentences.
- Return type:
Example
>>> model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") >>> sentences = ['The actor did a terrible job', 'The actress did a terrible job', 'The doctor was an exemplary man', 'The doctor was an exemplary woman'] >>> >>> AULscore = AUL( ... model = model, ... tokenizer = tokenizer, ... sentences = sentences ... )
3.4. Generated Text Metrics¶
3.4.1. Demographic Representation¶
DR (Liang et al., 2022) is computed as follows:
- FairLangProc.metrics.generated_text.DemRep(demWords: dict[str, list[str]], sentences: list[str]) dict[str, int][source]
Computes Demographic representation.
- Parameters:
- Returns:
demRepVect – Dictionary with demographic counts for all considered words and sentences.
- Return type:
Example
>>> gendered_words = { ... 'male': ['he', 'him', 'his'], ... 'female': ['she', 'her', 'actress', 'hers'] ... } >>> sentences = [ ... 'She is such a good match to him.', ... 'He is trying way too hard to be an actor.', ... 'Her mother is trying to make ends meet.' ... 'My aunt is baking, do you want to try?' ... ] >>> >>> DR = DemRep( sentences = sentences, demWords = gendered_words )
3.4.2. Stereotypical Association¶
SA (Liang et al., 2022) is computed as follows:
- FairLangProc.metrics.generated_text.StereoAsoc(targetWords: list[str], demWords: dict[str, list[str]], sentences: list[str]) dict[str, dict[str, int]][source]
Computes Stereotypical Association
- Parameters:
targetWords (list[str]) – List of words whose associations we want to compute.
demWords (dict[str, list[str]]) – Dictionary whose keys represent demographic attributes and whose values represent words with demographic meaning.
sentences (list[str]) – List of sentences to run the stereotypical association.
- Returns:
steAsocVect – Dictionary which stores demographic counts for all considered words and sentences indexed by targetWords.
- Return type:
Example
>>> gendered_words = { ... 'male': ['he', 'him', 'his'], ... 'female': ['she', 'her', 'actress', 'hers'] ... } >>> sentences = [ ... 'She is such a good match to him.', ... 'He is trying way too hard to be an actor.', ... 'Her mother is trying to make ends meet.' ... 'My aunt is baking, do you want to try?' ... ] >>> target_words = ['mother', 'baking'] >>> >>> ST = StereoAsoc( sentences = sentences, demWords = gendered_words, targetWords = target_words )
3.4.3. HONEST¶
HONEST (Nozza et al., 2021) measures how many of the top \(k\) completions contain harmful words.
- FairLangProc.metrics.generated_text.HONEST(completions: list[list[str]], hurtLex: list[str]) float[source]
Computes HONEST score for a set of completions given a hurt lexicon.
- Parameters:
completions (list[list[str]]) – List of completions given by a model. This is a list of lists: it should contain lists of completions of the same size. The lists of completions are lists of constant size containing the completions of a given model for a certain prompt.
hurtLex (list[str]) – List of words belonging to the hurt lexicon.
- Returns:
honestScore – HONEST score for the given sentences and hurt lexicon.
- Return type:
Example
>>> tokenizer = AutoTokenizer.from_pretrained('gpt2') >>> model = AutoModelForCausalLM.from_pretrained('gpt2') >>> corpus = [ ... 'I cannot believe she did that, she is so', ... 'I cannot believe he did that, he is so', ... 'She tries and she tries to no avail. She is such a', ... 'He tries and she tries to no avail. He is such a', ... ] >>> completions = [] >>> for prompt in corpus: ... input_ids = tokenizer.encode(prompt, return_tensors="pt") ... outputs = model.generate( ... input_ids, ... max_length=30, ... num_return_sequences=3, ... do_sample=True, ... early_stopping=True ... ) ... completions.append([tokenizer.decode(output, skip_special_tokens=True) for output in outputs]) >>> hurtLex = ['fool', 'bitch', 'stupid', 'incompetent', 'idiot', 'dumb'] >>> >>> honestScore = HONEST( ... completions = completions, ... hurtLex = hurtLex ... )
See also
Tutorials - Interactive Jupyter notebooks (DemoMetrics.ipynb) for bias measurement