Skip to content

Commit

Permalink
Adding CLP content to Responsible AI docs
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 447855077
  • Loading branch information
anirudh161 authored and Responsible ML Infra Team committed May 10, 2022
1 parent 364830d commit 4f78c62
Showing 1 changed file with 51 additions and 27 deletions.
78 changes: 51 additions & 27 deletions g3doc/guide/guidance.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
## Fairness Indicators: Thinking about Fairness Evaluation

### Interested in leveraging the Fairness Indicators Beta?

Before you do, we ask that you read through the following guidance.
# Fairness Indicators: Thinking about Fairness Evaluation

Fairness Indicators is a useful tool for evaluating _binary_ and _multi-class_
classifiers for fairness. Eventually, we hope to expand this tool, in
Expand All @@ -19,13 +15,13 @@ human societies are extremely complex! Understanding people, and their social
identities, social structures and cultural systems are each huge fields of open
research in their own right. Throw in the complexities of cross-cultural
differences around the globe, and getting even a foothold on understanding
societal impact can be challenging. Whenever possible, we recommend consulting
with appropriate domain experts, which may include social scientists,
societal impact can be challenging. Whenever possible, it is recommended you
consult with appropriate domain experts, which may include social scientists,
sociolinguists, and cultural anthropologists, as well as with members of the
populations on which technology will be deployed.

A single model, for example, the toxicity model that we leverage in our
[example colab](https://github.com/tensorflow/fairness-indicators/blob/master/g3doc/tutorials/Fairness_Indicators_Example_Colab.ipynb),
A single model, for example, the toxicity model that we leverage in the
[example colab](https://www.tensorflow.org/responsible_ai/fairness_indicators/tutorials/Fairness_Indicators_Example_Colab),
can be used in many different contexts. A toxicity model deployed on a website
to filter offensive comments, for example, is a very different use case than the
model being deployed in an example web UI where users can type in a sentence and
Expand All @@ -36,17 +32,17 @@ concerns.

The questions above are the foundation of what ethical considerations, including
fairness, you may want to take into account when designing and developing your
ML-based product. These questions also motivate _which_ metrics and _which_
groups of users you should use the tool to evaluate.
ML-based product. These questions also motivate which metrics and which groups
of users you should use the tool to evaluate.

Before diving in further, here are three resources we recommend as you get
Before diving in further, here are three recommended resources for getting
started:

* **[The People + AI Guidebook](https://pair.withgoogle.com/) for
Human-centered AI design:** This guidebook is a great resource for the
questions and aspects to keep in mind when designing a machine-learning
based product. While we created this guidebook with designers in mind, many
of the principles will help answer questions like the one we posed above.
of the principles will help answer questions like the one posed above.
* **[Our Fairness Lessons Learned](https://www.youtube.com/watch?v=6CwzDoE8J4M):**
This talk at Google I/O discusses lessons we have learned in our goal to
build and design inclusive products.
Expand All @@ -63,7 +59,7 @@ and harm for users.

The below sections will walk through some of the aspects to consider.

#### Which groups should I slice by?
## Which groups should I slice by?

In general, a good practice is to slice by as many groups as may be affected by
your product, since you never know when performance might differ for one of the
Expand Down Expand Up @@ -140,7 +136,7 @@ have different experiences? What does that mean for slices you should evaluate?
Collecting feedback from diverse users may also highlight potential slices to
prioritize.

#### Which metrics should I choose?
## Which metrics should I choose?

When selecting which metrics to evaluate for your system, consider who will be
experiencing your model, how it will be experienced, and the effects of that
Expand All @@ -161,7 +157,7 @@ then consider reporting (for each subgroup) the rate at which that label is
predicted. For example, a “good” label would be a label whose prediction grants
a person access to some resource, or enables them to perform some action.

#### Critical fairness metrics for classification
## Critical fairness metrics for classification

When thinking about a classification model, think about the effects of _errors_
(the differences between the actual “ground truth” label, and the label from the
Expand All @@ -176,13 +172,13 @@ when different metrics might be most appropriate.**

**Metrics available today in Fairness Indicators**

_Note: There are many valuable fairness metrics that are not currently supported
Note: There are many valuable fairness metrics that are not currently supported
in the Fairness Indicators beta. As we continue to add more metrics, we will
continue to add guidance for these metrics, here. Below, you can access
instructions to add your own metrics to Fairness Indicators. Additionally,
please reach out to [[email protected]](mailto:[email protected]) if there are
metrics that you would like to see. We hope to partner with you to build this
out further._
out further.

**Positive Rate / Negative Rate**

Expand Down Expand Up @@ -224,8 +220,8 @@ out further._
These are also important for Facial Analysis Technologies such as face
detection or face attributes

**Note:** When both “positive” and “negative” mistakes are equally important,
the metric is called “equality of
Note: When both “positive” and “negative” mistakes are equally important, the
metric is called “equality of
<span style="text-decoration:underline;">odds</span>”. This can be measured by
evaluating and aiming for equality across both the TNR & FNR, or both the TPR &
FPR. For example, an app that counts how many cars go past a stop sign is
Expand Down Expand Up @@ -264,12 +260,40 @@ false positive) or accidentally excludes a car (a false negative).
Cases where the fraction of correct negative predictions should be equal
across subgroups

**Note**: When used together, False Discovery Rate and False Omission Rate
relate to Conditional Use Accuracy Equality, when FDR and FOR are both equal
across subgroups. FDR and FOR are also similar to FPR and FNR, where FDR/FOR
compare FP/FN to predicted negative/positive data points, and FPR/FNR compare
FP/FN to ground truth negative/positive data points. FDR/FOR can be used instead
of FPR/FNR when predictive parity is more critical than equality of opportunity.
Note: When used together, False Discovery Rate and False Omission Rate relate to
Conditional Use Accuracy Equality, when FDR and FOR are both equal across
subgroups. FDR and FOR are also similar to FPR and FNR, where FDR/FOR compare
FP/FN to predicted negative/positive data points, and FPR/FNR compare FP/FN to
ground truth negative/positive data points. FDR/FOR can be used instead of
FPR/FNR when predictive parity is more critical than equality of opportunity.

**Overall Flip Rate / Positive to Negative Prediction Flip Rate / Negative to
Positive Prediction Flip Rate**

* *<span style="text-decoration:underline;">Definition:</span>* The
probability that the classifier gives a different prediction if the identity
attribute in a given feature were changed.
* *<span style="text-decoration:underline;">Relates to:</span>* Counterfactual
fairness
* *<span style="text-decoration:underline;">When to use this metric:</span>*
When determining whether the model’s prediction changes when the sensitive
attributes referenced in the example is removed or replaced. If it does,
consider using the Counterfactual Logit Pairing technique within the
Tensorflow Model Remediation library.

**Flip Count / Positive to Negative Prediction Flip Count / Negative to Positive
Prediction Flip Count** *

* *<span style="text-decoration:underline;">Definition:</span>* The number of
times the classifier gives a different prediction if the identity term in a
given example were changed.
* *<span style="text-decoration:underline;">Relates to:</span>* Counterfactual
fairness
* *<span style="text-decoration:underline;">When to use this metric:</span>*
When determining whether the model’s prediction changes when the sensitive
attributes referenced in the example is removed or replaced. If it does,
consider using the Counterfactual Logit Pairing technique within the
Tensorflow Model Remediation library.

**Examples of which metrics to select**

Expand All @@ -294,7 +318,7 @@ Follow the documentation
[here](https://github.com/tensorflow/model-analysis/blob/master/g3doc/post_export_metrics.md)
to add you own custom metric.

#### Final notes
## Final notes

**A gap in metric between two groups can be a sign that your model may have
unfair skews**. You should interpret your results according to your use case.
Expand Down

0 comments on commit 4f78c62

Please sign in to comment.