LLM Output Evaluation
This cluster discusses challenges in verifying and evaluating the correctness of Large Language Model (LLM) outputs, including tools, frameworks like Giskard and DeepEval, and methods such as LLM-as-judge.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Hey HN! We've built this platform that allows you to evaluate how well your LLM implementation is performing, despite that be using open source tools such as LangChain, lLamaIndex, or even your own internal framework.The idea is you would use our open source package (https://github.com/confident-ai/deepeval) to evaluate LLM outputs using criteria such as factual consistency, relevancy, bias, et
How do you verify your LLMs output?
LLMs can't evaluate their own output. LLMs suggest possibilities, but can't evaluate them. Imagine an insane man who is rumbling something smart, but doesn't self-reflect. The evaluation is done against some framework of values that are considered true: the rules of a board game, the language syntax or something else. LLMs also can't fabricate evaluation because the latter is a rather rigid and precise model, a unlike natural language. Otherwise you could set up two LLMs ques
How do you verify that the output is correct for these areas that the LLM dwarfs your knowledge?
and each LLM can invent some ridiculous suprise. Who is going to check if it did right thing?
Curious how you can even tell the llm result is correct when you are apparently unable to validate it with other methods
Evals do help to account for correctness when it comes to LLMs
You evaluate whether they ceitically review the LLM answer or just take it at truth.
Yes, the LLM will give you an answer. Are you verifying that what the LLM tells you is correct? How would you even know?
Thanks! LLM testing is a specific challenge, we’re interested in your feedback on our alpha version.Here’s a notebook to try it: https://docs.giskard.ai/en/latest/reference/notebooks/llm_co...