Effectively Evaluate My LLM App

Evaluating the performance of an LLM app can be a long process, and difficult to conduct correctly without suitable tools. This page gathers some advice for evaluating your LLM app and contains valuable indications on the use of Wikit Semantics' evaluation tool.

🧑‍🔬 Some tips for "properly" testing

When to conduct tests?

The testing phase is an important phase often carried out upstream of the chatbot's deployment. However, even after deployment, a chatbot always benefits from regular tests to improve its performance or to ensure it remains stable. For example, it can be useful to conduct a few tests:

After adding/modifying numerous documents and/or data sources
After a significant modification to the prompt used
After changing the models used

This list is not exhaustive. The important thing is to keep in mind that any significant change to the many building blocks that make up the chatbot can influence its behavior. A few quick tests are therefore often useful to verify that the quality of the responses produced is not altered. It's better to prevent than to cure!

How to test?

By asking questions to the bot and checking the quality of the responses produced! While this may seem obvious, incorrect information can be extracted if it's not done properly. Here are some tips:

Ask questions that are close to the bot's real-world usage: put yourself in your users' shoes: what questions will they ask? How will the questions be formulated? Keywords or complete sentences?
Ask questions covering the entire scope of the chatbot: above all, do not stop at a question that you reformulate too many times, do not get stuck at the first hallucination encountered, do not ignore an entire domain of competence that the chatbot is supposed to support.

🧪 Leverage Wikit Semantics' evaluation feature

To accelerate this process, you can use the automatic evaluation feature available in the Wikit Semantics console!

Accessing the evaluation

Select "LLM Apps" in the sidebar
Click on the 3 dots to open the menu of the LLM app you want to evaluate, then click on "Manage LLM app"
Select the "Evaluation" tab

Process to follow to access the evaluation of an LLM app

Create your evaluation dataset

To create a dataset, start by clicking on "create a dataset".

The following screen allows you to compose your own questions, as well as the associated expected answers.

Screen for creating your evaluation dataset

❓ Some recommendations regarding the questions:
Ask questions relevant to the chatbot's real-world usage: what will your users ask?
Ensure that the answers to the questions are present in the data sources to which the LLM app has access.
Ensure that the questions cover a fairly wide range of requests (avoid asking the same question multiple times).

💬 Some recommendations regarding the answers:
Ideally and when possible, paraphrase the document containing the answer.
Create complete and informative answers, and in a format close to what the LLM should produce. Avoid answers that are too short (example: "In 2015." or "Contact Mr. Dupont").

Once your dataset is created, you just need to click on "Launch evaluation" to evaluate your LLM app 🧪!

Interpreting your results

This section helps you understand the results obtained from the evaluation.

View of the results screen.

On the upper part of the screen, you can see the overall score attributed to your LLM app. This score, between 0 and 100%, reflects the overall quality of the responses generated by the app. This score goes further than simply evaluating whether the answer is right or wrong: it reflects the proximity of the generated response to the expected response in content (factual accuracy) AND in form (length, structure)!

On the lower part, you can view the generated responses and compare them with your expected responses. You can thus directly verify that the model omits nothing. A small indicator also tells you if the response can be considered right or wrong. Keep in mind that since this evaluation is automatic, it can sometimes make mistakes 😉.

What if my results are not up to my expectations?

Keep in mind that the design of a high-performing chatbot and its evaluation are generally an iterative process. While modifying the prompt is the most direct lever to influence the chatbot's responses, several things need to be checked before activating it:

Observe the generated responses and compare them with the expected responses. Shouldn't some expected responses be replaced by the generated responses, which are sometimes more complete and informative?
Verify that the expected response is indeed present in the documents to which the LLM app has access. This is sometimes a simple oversight in activating a data source, or uploading a document into the data source.
Check the quality of the source documents and associated fragments. A chatbot, before being an AI, is first and foremost a well-structured data source!
Modify the prompt so that the structure of the response better resembles what is expected (length, structure, ...)

To go further

This section details how a response's score is calculated.

A response's score is based on the resemblance of the generated response to the expected response in content AND in form. To obtain a good score, the generated response must exhibit both.

The following figure shows an example:

Example of results that can be expected for 3 generated responses.

Automatic evaluation of response quality relies mainly on 2 tools:

entity recognition (name, phone number, email, links, ...) between the expected and generated responses.
semantic proximity between the expected and generated responses (does the generated response indeed talk about the same subject as the expected response? the same terminology?)

The aggregation of these scores provides a global idea of the quality of the response with respect to what is sought.

Effectively Evaluate My LLM App ​

🧑‍🔬 Some tips for "properly" testing ​

When to conduct tests? ​

How to test? ​

🧪 Leverage Wikit Semantics' evaluation feature ​

Accessing the evaluation ​

Create your evaluation dataset ​

Interpreting your results ​

What if my results are not up to my expectations? ​

To go further ​