Simple and fun evaluation for LLM tasks
eyeball_pp is a framework to evaluate tasks which use llms.
LLMs make is very easy to build intelligent systems, but the systems are probabilistic. You can't say my system does X, rather you say that my system is trying to do X and it's succesful Y% of times.
This framework helps you answer questions like: "Hows my system performing as I change it?", "Which llm is the best for my specific task?" or "This prompt change looks good on the example I just tested, does it work with all the other things I've tested?"
Your task can be arbitrary, the framework only cares about inputs and outputs.
If you've been eyeballing how well your changes are working, this framework should fit right in and help you evaluate your task in a more methodical manner WITHOUT you having to become methodical.
eyeball_pp is a python library which can be installed via pip
pip install eyeball_pp
To see an example see this notebook or checkout more examples in the examples folder.
eyeball_pp has 3 simple concepts -- record, evaluate and rerun.
eyeball_pp records the inputs and outputs of your task runs as you are running it and saves them as checkpoints. You can record this locally while developing or to a db from a production system. You can optionally record human feedback for the task output too.
You can record your task using the record_task
decorator. This will record every run of this function call as a Checkpoint
for future comparison. The input_names specify which inputs to record and the function return value is saved as the output.
import eyeball_pp
@eyeball_pp.record_task(input_names=['input_a', 'input_b'])
def your_task_function(input_a, input_b):
# Your task can run arbitrary code
...
return task_output
If you want to record additional inputs within your function call you can just call eyeball_pp.record_input('variable_name', value)
inside your function.
If your sytem is more complicated, you can also use the record_input
and record_output
functions with the start_recording_session
context manager
import eyeball_pp
with eyeball_pp.start_recording_session(task_name="your_task", checkpoint_id="some_custom_unique_id"):
eyeball_pp.record_input('input_a', input_a_value)
eyeball_pp.record_input('input_b', input_b_value)
# your task code
eyeball_pp.record_output(output)
OR without the context manager..
eyeball_pp.record_input(task_name="your_task", checkpoint_id="some_custom_unique_id", variable_name="input_a", value=input_a_value)
..
eyeball_pp.record_output(task_name="your_task", checkpoint_id="some_custom_unique_id", variable_name='output', output=output)
Data by default gets recorded as yaml files in your repo so you can always inspect it, change it or delete it. If you want to log data from a production system you can record it in your own db or get an api key from https://eyeball.tark.ai
Once you have recorded a few runs, you can then evaluate your system.
eyeball_pp tells you how your system is performing by looking at how many runs were succesful. If the output of your task can be evaluated objectively then you can supply a custom output_grader
and if not you can just use the built in model graded eval. This will use a model to figure out if your task output is solving the objective you want it to. And if you've been recording human feedback for your task runs, it will use this feedback to fine-tune the evaluator model
# The example below uses the built in model graded eval
eyeball_pp.evaluate_system(task_objective="The task should answer questions based on the context provided and also show the sources")
Example output of the above command would be something like
Date | Results | Stats |
---|---|---|
09 Aug | 85.7% success (6.0/7) | 7 datapoints, 3 unique inputs |
Run | Results | Stats | Params | Output Variance (higher value ⇒ unpredictable system) |
---|---|---|---|---|
Rerun on 09 Aug 05:39 | 66.7% success (2/3) | 3 datapoints, 3 unique inputs | temperature=0.2 | |
09 Aug 05:30 - 05:39 | 100.0% success (4/4) | 4 datapoints, 3 unique inputs | 0.0 |
The comparison will also output system_health_by_date.md file, system_health_by_run_history.md and per_input_breakdown.md files in your repo with the latest system health updates.
You can then re-run these pre-recorded examples as you make changes. This will only re-run each unique set of input variables. eg. if you recorded a run of your_task_function(1, 2)
5 times and your_task_function(3, 4)
2 times, the re-run would only run 2 examples -- (1, 2) and (3, 4). Each of these re-runs is saved as a new checkpoint for comparison later.
from eyeball_pp import rerun_recorded_examples
for input_vars in rerun_recorded_examples(input_names=['input_a', 'input_b']):
your_task_function(input_vars['input_a'], input_vars['input_b'])
You can also re-run the examples with a set of parameters you want to evaluate (eg. model, temperature etc.) -- These params will show up in the comparison later and help you decide which params result in the best performance on your task.
from eyeball_pp import get_eval_param, rerun_recorded_examples
for vars in rerun_recorded_examples(input_names=['input_a', 'input_b'], eval_params_list=[{'model': 'gpt-4', 'temperature': 0.7}, {'model': 'mpt-30b-chat'}]):
your_task_function(vars['input_a'], vars['input_b'])
# You can access these eval params from anywhere in your code
@eyeball_pp.record_task(input_names=['input_a', 'input_b'])
def your_task_function(input_a, input_b):
...
model = get_eval_param('model') or 'gpt-3.5-turbo'
temperature = get_eval_param('temperature') or 0.5
For the record_task
decorator you need to ensure that the inputs to the function and outputs are json serialable. If the variables are custom classes you can define the to_json
function on that object.
eg.
@eyeball_pp.record_task(args_to_skip=["input_a"])
def your_task_function(input_a, input_b: SomeComplexType) -> SomeComplexOutput:
...
return task_output
class SomeComplexType:
def to_json(self) -> str:
...
class SomeComplexOutput:
def to_json(self) -> str:
...
You can set the sample rate for recording, by default it's 1.0
# Set separate config for dev and production
if is_dev_build():
eyeball_pp.set_config(sample_rate=1.0, dir_path="/path/to/store/recorded/examples")
else:
# More details on the api_key integration below
eyeball_pp.set_config(sample_rate=0.1, api_key=<eyeball_pp_apikey>)
By default a unique example_id is created based on the values of the input variables, but that might not always work.
@eyeball_pp.record_task(example_id_arg_name='request_id', args_to_skip=['request_id'])
def your_task_function(request_id, input_a, input_b):
...
return task_output
If you record human feedback from your production system you can always log that via the library
import eyeball_pp
from eyeball_pp import ResponseFeedback
eyeball_pp.record_human_feedback(checkpoint_id, response_feedback=ResponseFeedback.POSITIVE, feedback_details="I liked the response as it selected the sources correctly")
But you can also rate examples yourself
eyeball_pp.rate_examples()
This will let you compare and rate examples yourself via the command line Example output of this command would look like:
Consider Example(input_a=1, input_b=2):
Does the output `4` fulfil the objecive of the task? (Y/n)
Does the output `3` fulfil the objecive of the task? (Y/n)
Which response is better?
A) `3` is better than `4`
B) `4` is better than `3`
C) Both are equivalent
eyeball_pp works out of the box locally but if you want to record information from production you can get an apikey from https://eyeball.tark.ai