Author: Nomic and Patronus

Evaluating LLM Hallucination Benchmarks with Embeddings

As machine learning practioners and researchers know, the first step to building any machine learning model is looking at your data. For their latest model release, Patronus AI worked with Nomic to validate the quality of their new LLM Hallucination Benchmark dataset called HaluBench! HaluBench is a benchmark for measuring a language model's propensity to hallucinate.

Each datapoint in HaluBench consists of a context passage of text, a question about the context, an answer and a label signifying if the answer was a hallucination relative to the context.

Below is all of Patronus' HaluBench loaded into an Atlas dataset with an Atlas map built over text embeddings of the answer column.

Each point on the Atlas map represents a row in the HaluBench dataset. Points that are close together have answers that are semantically similar in the embedding space of Nomic Embed Text.

Floating topic labels give a bird's-eye view of the topical distribution of the answer column of the dataset. For example, all the rows that have an answer that is topically about medicine reside in the bottom right cluster.

Explore HaluBench yourself in the Nomic Atlas embedding space map below. Use the filter tools on the left pane to search, semantically lasso and filter HaluBench and the aesthetic tools on the right to slice and dice it along its attributes.

When working with classification datasets in Atlas, it's often useful to color the dataset by the target label and analyze the data distribution relative to the class label. This can often reveal regions of data that are either easy or difficult to classify.

Doing so on HaluBench shows a surprising view of the data: points with similar class labels cluster together.

Remember, the Atlas map organizes HaluBench by answers that are semantically similar in the Nomic Embed text latent space.

This suggests that you can build a very good hallucination classifier on HaluBench just by looking at textual content of answer column and ignoring the passage and question!

On the surface this is a troublesome sign - the text of solely the answer appears to be highly correlated with whether the (context, question, answer) triplet is a hullucination!

You can ignore the context and still do well on the benchmark! Does that mean HaluBench is broken?

Let's investigate further.

After a bit of exploration, you'll notice the HaluBench dataset is made up several subsets containing hallucination data triplets from various text domains. More precisely, HaluBench consists of HaluEval, RAGTruth and novel datasets created by Patronus AI.

Filtering down and inspecting specific dataset subsets, you'll notice that only the HaluEval subset contains the above spurious correlation between answer semantics and class label.

HaluEval is a hallucination benchmark which is a subset of HalluBench but released by a team not affiliated with Patronus in late 2023. Since this has been the main hallucination benchmark until now, the Patronus team decided to include it as a building block while adding their own spice.

Inspecting the HaluBench subsets created by Patronus, you'll notice that there do not appear to be any noticeable correlations between the semantics of the answer and the class label indicating that Patronus' benchmark data does not suffer from the same failure points as other benchmarks! That makes HaluBench a more comprehensive and rigorous benchmark than HaluEval.

With the power of embeddings and Atlas, the Patronus team was able to quickly identify low quality benchmark datapoints and in-turn increase the quality of their flagship LLM hallucination benchmark for their customers!

Trusting your evaluation provider means trusting the data their evaluations are built on.

We're excited to enable Patronus with observability into their research team's unstructured datasets with Atlas!

Map HaluBench Yourself

You can load HaluBench into Atlas yourself by creating a Nomic account and running the following code snippet:

from nomic import atlas
from datasets import load_dataset

hf_dataset = load_dataset("PatronusAI/halubench")


dataset = atlas.map_data(
    identifier='halubench',
    data=hf_dataset['test'].to_pandas(),
    is_public=True,
    id_field='id',
    indexed_field='answer'
)