The MTEB benchmark is dead

5 comments

It has been for a while, we ended up building our own test set to evaluate embedding models on our domain.

What we realized after doing this is that MTEB has always been a poor indicator, as embedding model performance varies wildly in-domain compared to out-of-domain. You'll get decent performance (lets say 70%) with most models, but eeking out gains over that is domain-dependent more than it is model-dependent.

Personally I recommend NV-Embed because it's easy to deploy and get the other performance measurements (e.g. speed) to be high spec. You can then simply enrich your data itself by e.g. using an LLM to create standardized artifacts that point back to the original text, kind of like an "embedding symlink."

Our observation has widely been that after standardizing data, the best-n models mostly perform the same.

Unfortunately it requires commercial licensing. I spoke with them a while ago about pricing and it was awfully expensive for being just one part of a larger product. We have been trying other common open source models and the results have been comparable when using them for retrieval on our domain specific data.

Datasets need to stop shipping with any training sets at all! And they should forbid anyone from using the test set to update the parameters of any model through their license.

We did this with ObjectNet (https://objectnet.dev/) years ago. It's only a test set, no training set provided at all. Back then it was very controversial and we were given a hard time for it initially. Now it's more accepted. Time to make this idea mainstream.

No more training sets. Everything should be out of domain.

I don't know how this is possible with LLM tests. The closed source models will get access to at least the questions when sending the questions over the fence via API.

This gives closed source models an enormous advantage over open-source models.

The FrontierMath dataset has this same problem[1].

It's a shame because creating these benchmarks is time consuming and expensive.

I don't know of a way to fix this except perhaps partially by using reward models to evaluate results on random questions instead of using datasets, but there would be a lot of reproducibility problems with that.

Still -- not sure how to overcome this.

[1]: https://news.ycombinator.com/item?id=42494217

It's possible.

I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.

If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.

FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.

Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.

We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.

I would like to agree with you, but I doubt the honor system will work here. We are talking about companies that have blatantly trampled (or are willing to risk a judicial confrontation about trampling) copyright. It would be unreasonable to assume they would not engage in the same behavior about benchmarks and test sets, especially with the amount of money on the line for the winners.

I understand the idea but I don't think that it is beneficial in the end.

Access to the dataset is needed to understand why we get a given result. First from a transparency point of view to check if results make sense and why one model is favored compared to another one.

But also, it is needed to understand why a model will perform badly on some aspect to be able to determine how to improve the model.

The MTEB benchmark was never that great since embeddings are used for more specific domain-specific tasks (e.g. search/clustering) that can't really be represented well in a generalized test, moreso than LLM next-token-prediction benchmarks which aren't great either.

As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

> As with all LLM models and their subproducts, the only way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.

This is excellent advice. Sadly, very few people/organizations implement their own evaluation suites.

It doesn't make much sense to put data infrastructure in production without first evaluating its performance (IOPS, uptime, scalability, etc.) on internal workloads; it is no different for embedding models or models in general for that matter.

I’m not closely familiar with this benchmark, but data leakage in machine learning can be way too easy to accidentally introduce even under the best of intentions. It really does require diligence at every stage of experiment and model design to strictly firewall all test data from any and all training influence. So, not surprising when leakage breaks highly publicized benchmarks.

> data leakage in machine learning can be way too easy to accidentally introduce even under the best of intentions

And lots of people in this space definitely don’t have the best of intentions.

I feel this is common throughout all of training, even on public data. Every time we talk about something specific at length, that becomes part of the training data and that influences the models. For example, ask a problem about a butterfly flapping its wings causing a tornado and all modern LLMs immediately recognize the classic example of chaos theory, but change the entities and suddenly it's not so smart. Same thing for the current fixation on the number of Rs in strawberry.

There was recently a post showing how LLMs could actively try to deceive the user to hide its conflicting alignment, and using a chain of thought style prompt showed how it did this very deliberately. However, the thought process it produced and the wording sounded exactly like every example of this theoretical alignment problem. Given that an LLM chooses the most probable tokens based on what it has seen in training, could it be that we unintentionally trained it to respond this way?