LLM Evaluation: From Experimentation to Production

By Piergiacomo De Marchi

July 23, 2024

It seems like everybody in tech is working on a new AI project. But how many of these generative AI (GenAI) initiatives have made it into production?

A recent Gartner report found that while tech executives are full of enthusiasm for AI initiatives, actual production deployment rates remain below 10%. An Intel article supports this, explaining that a lack of reliability is a major obstacle to adopting large language model (LLM) platforms.

Consistent and accurate results are crucial for building scalable LLM-powered applications, making rigorous LLM evaluation essential.

LLM Evaluation

LLM Evaluation is the process of ensuring that the outputs of language models and LLM-powered applications align with human intentions, meeting desired quality, performance, safety, and ethical standards.

LLM model evaluation looks at the overall performance of the foundational model across a range of different general tasks using well-known benchmarks.
LLM system evaluation, or LLM task evaluation, examines the performance of the entire application in specific real-world use cases. An LLM system can be composed of multiple components like function calling (for agents), retrieval systems (in RAG), response caching, and multiple model calls, where benchmarks alone are insufficient.

The term “LLM evals” is often abused and used interchangeably, but understanding the distinction is crucial. AI engineers should not rely solely on LLM leaderboards to assess the quality of LLM systems built for real-world applications, as these benchmarks are inadequate for measuring human preferences.

LLM as a Judge

The concept of LLM-as-a-judge (Zheng et al.,2024) involves leveraging a LLM to evaluate the outputs of another LLM system, providing both a score and an explanation.

Generating human labels is expensive, making them often scarce or unavailable. Using an LLM as a judge addresses this issue by providing an automated, scalable solution.

While it might seem risky to use a probabilistic LLM to evaluate another LLM system, it's comparable to how we use human intelligence to evaluate human performance in exams or job interviews. This automated approach has proven effective (Zheng et al.,2024), streamlining the evaluation process while maintaining high standards.

LLM System Evaluation Metrics

The choice of the LLM evaluation metric to use is crucial and heavily depends on the application and task you want to evaluate.

Summarization: Does the summary accurately, cohesively, and relevantly represent the input content?
Retrieval-Augmented Generation (RAG): Are the retrieved documents and the final generated answer relevant to the user's query?
Question Answering: How effectively does the system answer the user's question?
Named Entity Recognition (NER): How accurately does the model identify and classify entities in the text?
Text-to-SQL: How correctly and efficiently does the model translate natural language queries into SQL commands?

A few examples of these metrics are listed in the table below.

Metric	Task	Details
BERTScore	Summarization	It computes a similarity score for each token in the candidate sentence (generated text) with each token in the reference (ground truth) sentence.
Answer Correctness	Question Answering RAG	Evaluates the accuracy of the generated answer when compared to the reference (ground truth) one.
Context Precision	RAG	It measures how well relevant context chunks are ranked higher than irrelevant ones in a given list, with respect to a given reference (ground truth) sentence.
Toxicity	Question Answering RAG	It verifies if responses are racist, biased, or toxic.

LLM System Evaluation Stages: Offline and Online

Bringing an LLM system from experimentation to production involves a multistep approach.

There are three stages in the LLM evaluation and monitoring cycle: Development, Continuous Integration/Continuous Evaluation/Continuous Deployment (CI/CE/CD), and Production.

Development: The developer codes a new LLM feature on their laptop. Before pushing the changes to GitHub, they conduct preliminary evaluations (tests) on a small sample of data entries (5 to 10) to ensure basic functionality.
CI/CE/CD: Satisfied with the initial tests, the developer pushes the feature to GitHub. At this stage, a full automated evaluation/testing is triggered, running comprehensive tests on a large dataset covering all potential edge cases.
Production: If all tests pass during the Continuous Integration pipeline, the new code changes are deployed to production. However, evaluation continues post-deployment. Real end-users can introduce unexpected inputs and models can drift, so the LLM system must be constantly monitored, and its outputs evaluated before returning them to users. Continuous LLM Application Performance Monitoring (LLM-APM) is critical.

Example: RAG Evaluation (Answer Correctness)

Let's iterate over the three stages of LLM evaluation through an example. We are going to use the Lynxius platform to evaluate a Retrieval Augmented Generation (RAG) system, one of the most popular techniques for building applications with LLMs.

RAG leverages vector search to retrieve relevant documents based on the input query. These retrieved documents are then provided as context to the LLM, enhancing its ability to generate more informed and accurate responses. Instead of relying solely on patterns learned during training, RAG incorporates these relevant documents to produce a more contextually accurate and precise response.

In this example, we aim to calculate Answer Correctness to evaluate how accurately the RAG system's generated answer matches a reference (ground-truth) value. Our Answer Correctness Evaluator uses input entries from a test set consisting of question/reference pairs.

For simplicity, let's assume we already build our RAG system which uses OpenAI GPT-4o and we can consume it with a simple get_answer call.

Development Stage

Before pushing any changes to GitHub, we should conduct preliminary evaluations on a small test set to ensure basic functionality.

For this example, we'll hardcode the dataset directly into the source code rather than pulling it from the Lynxius platform.

With the dataset ready we can run our evaluation.

The results on the Lynxius platform show that our RAG system correctly answered all the input queries, indicating a successful experiment.

Continuous Evaluation Stage

Satisfied with the Development Stage tests, we push the code changes to GitHub. This triggers our testing pipeline, which runs an evaluation script like the previous one, but with two key differences:

The test set is larger and includes multiple edge cases.
The evaluation results are compared with the Main Branch Baseline, which reflects the latest status of your system after the most recent code merge.

Checking the logs, it is evident that the average Answer Correctness score of the latest PR has improved compared to the baseline. We can therefore merge this code with the main branch.

Production Stage

Knowing that our RAG system performs well on our large test set, we can confidently deploy it to production.

Once the system is live and used by real users, continuous monitoring becomes essential. Lynxius Tracing enables easy observation and evaluation of production systems by simply using the @lynxius_observe decorator. It is important to note that the reference (ground-truth) values are not available in production, so evaluations happen without labels.

By adding the @lynxius_observe decorator, real user queries are evaluated in real-time, enabling the system to promptly handle failures by generating a new response, warning the user about potential inaccuracies, or taking other corrective actions.

Tracing is ideal to keep production KPIs, allowing you to always know the real-time status of your system and compare your development branches against it when needed.

Baseline Comparison of Production and Main Branch

Conclusion

In conclusion, transitioning an LLM system from experimentation to production involves meticulous evaluation, rigorous testing, and continuous monitoring. By leveraging the Lynxius platform, you can streamline this process, ensuring your system maintains high performance and reliability standards even under real-world conditions. Start using Lynxius for free today to speed up your LLM evaluation and monitoring practices.

For any any queries related to this article, feel free to connect with us on Discord.