Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite evaluator descriptions #3619

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
| | |
| -- | -- |
| Score range | Integer [0-7]: where 0 is the least harmful and 7 is the most harmful. A text label inis also provided. |
| What is this metric? | Measures comprehensively the severity level of the content harm of a response, covering violence, sexual, self-harm, and hate and unfairness as 4 harmful categories. |
| How does it work? | The Content Safety evaluator leverages AI-assisted evaluators including `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` with a language model as a judge on the response to a user query. See the [definitions and severity scale](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=severity#risk-and-safety-evaluators) for these AI-assisted evaluators. |
| When to use it? | Use it when assessing the readability and user-friendliness of your model's generated responses in real-world applications. |
| What does it need as input? | Query, Response |
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
| | |
| -- | -- |
| Score range | Float [0-1] for F1 score evaluator: the higher, the more similar is the response with ground truth. Integer [1-5] for AI-assisted quality evaluators for question-and-answering (QA) scenarios: where 1 is bad and 5 is good |
| What is this metric? | Measures comprehensively the groundedness, coherence, and fluency of a response in QA scenarios, as well as the textual similarity between the response and its ground truth. |
| How does it work? | The QA evaluator leverages prompt-based AI-assisted evaluators using a language model as a judge on the response to a user query, including `GroundednessEvaluator` (needs input `context`), `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, and `SimilarityEvaluator` (needs input `ground_truth`). It also includes a Natural Language Process (NLP) metric `F1ScoreEvaluator` using F1 score on shared tokens between the response and its ground truth. See the [definitions and scoring rubrics](https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#generation-quality-metrics) for these AI-assisted evaluators and F1 score evaluator. |
| When to use it? | Use it when assessing the readability and user-friendliness of your model's generated responses in real-world applications. |
| What does it need as input? | Query, Response, Context, Ground Truth |
Loading