a cost-to-be-correct meter for model selection #5869

kvchitrapu · 2024-12-27T17:09:00Z

What problem or use case are you trying to solve?

Users need a way to balance cost and accuracy when choosing AI models for their specific tasks. For example, someone might prefer paying for a high-accuracy model for unfamiliar languages like TrueScript but would opt for a less expensive, lower-tier model for Python, where they can handle corrections themselves.

Describe the UX of the solution you'd like

A clear, intuitive "Cost-to-be-Correct Meter" within the UI that shows:

The estimated cost of using a particular model.
A relative accuracy score or confidence level.
Guidance or recommendations based on the user's chosen parameters (e.g., language familiarity or budget).
This should allow users to compare options and make informed decisions directly from the interface.

Do you have thoughts on the technical implementation?

Integrate the meter into the model selection workflow.
Pull cost and accuracy metrics dynamically from model data. Perhaps from existing SWE results run on the model?
Allow users to input parameters like budget and level of familiarity with a language to generate personalized recommendations.

Describe alternatives you've considered

Hardcoding static recommendations based on language or task.

This feature could enhance the user experience by reducing friction in selecting the right model while aligning costs with needs. It could also serve as a unique value proposition for the project.

neubig · 2024-12-28T14:58:34Z

I'll try to work on this (based on @xingyaoww 's spreadsheet) unless someone else would like to take it.

BradKML · 2024-12-29T01:42:27Z

Okay, so how do we determine which ones take priority to be put on the board? Does it include or exclude finetunes and specialized merges (see Sakana AI) that are not on conventional providers like OpenRouter? If it does, is it possible to include multiple benchmarks such that people can see how consistent it is (in case of benchmark hacking)?
Also as a secondary question: would OpenRouter be treated as a universal standard? Maybe HuggingFace models loaded on a rented GPU, and have that monitored instead?

kvchitrapu added the enhancement New feature or request label Dec 27, 2024

neubig self-assigned this Dec 28, 2024

neubig mentioned this issue Dec 28, 2024

Create a competitive agent with open LLMs #1085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a cost-to-be-correct meter for model selection #5869

a cost-to-be-correct meter for model selection #5869

kvchitrapu commented Dec 27, 2024 •

edited

Loading

neubig commented Dec 28, 2024

BradKML commented Dec 29, 2024 •

edited

Loading

a cost-to-be-correct meter for model selection #5869

a cost-to-be-correct meter for model selection #5869

Comments

kvchitrapu commented Dec 27, 2024 • edited Loading

neubig commented Dec 28, 2024

BradKML commented Dec 29, 2024 • edited Loading

kvchitrapu commented Dec 27, 2024 •

edited

Loading

BradKML commented Dec 29, 2024 •

edited

Loading