> We still do not know, for instance, what makes a good tokeniser (Gowda and May, 2020; Cognetta et al., 2024): which characteristics should its produced subwords `s` have to be a good starting point for language modelling? If we knew this, then we could define an objective function which we could evaluate tokenisers with.
I don't see how the authors get past this true general statement from the first paragraph of the introduction. Finding a good tokenizer is not just NP-hard; we have no idea how hard it might be because we don't have theoretical agreement on what "good" means.
In order to have something to prove, the authors decide (somewhat arbitrarily):
> Specifically, we focus on finding tokenisers that maximise the compression of a text. Given this objective, we then define the tokenisation problem as the task of finding a tokeniser which compresses a dataset to at most δ symbols.
Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.
I'm also not sure how realistic the limitation to "at most δ symbols" is. I mean, that limit is undeniably useful to make the proof of NP-completeness go through, because it's a similar mechanism to the minimum number of satisfied clauses in the MAX-2-SAT definition. But why not just keep adding tokens as needed, rather than imposing any preordained limit? IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings. When that tokenizer was being designed, I don't imagine they worried much if the final number had been 60k or even 100k. How could you possibly choose a meaningful δ from first principles?
To put that point a different way, imagine the authors had proven NP-completeness by reduction from the Knapsack Problem, where the knapsack you're packing has some maximum capacity. If you can easily swap your knapsack out for a larger knapsack whenever it gets (close to) full, then the problem becomes trivial.
If the authors managed to prove that any arbitrary objective function would lead to NP-hard tokenizer optimization problem, then their result would be more general. If the paper proves that somehow, I missed it.
I suppose this paper suggests "here be dragons" in an interesting if incomplete way, but I would also say there's no need to hurt yourself with an expensive optimization problem when you're not even sure it delivers the results you want.
Can anyone explain to a layman like myself why it wouldn't be better if we just used bytes as tokens?
It's two problems:
1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).
2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.
I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].
Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.
When model APIs bill by the token, this is an important consideration.
[deleted]
It's not optimal, but if you want to play it safe you can use 1 byte = 1 token, plus a few control tokens like stop, user turn, model turn, etc.
I don't know if this is related but I work on a project with a language parser and so a tokenizer and one issue we have is the parser/tokenier was not designed to work on incomplete code so it's not useful for auto complete.
That property of how good the design is for helping write and debug (useful error messages) seems like yet another metric that could be used
I also don't know if tokenizers can be divorced from parsers. Some languages have constructs that require context that comes from the parser
NP is a category of decision problems - problems with boolean answers. Saying that it's NP-complete to find the tokeniser that produces the fewest symbols is meaningless. You have to convert it to the form "is there a tokenizer that produces fewer than N symbols?" before it even makes sense to ask whether it's NP-complete.
I fully agree with your final statement, but needing to constrain the problem in an artificial way to prove it's NP-complete doesn't mean the constraint was justified or realistic, because then you've only proved the constrained version of the decision problem is NP-hard.
There might be plenty of perfectly "good" tokenizers (whatever that ends up meaning) that can be found or generated without formulating their design as an NP-complete decision problem. Claiming "tokenization is NP-complete" (paper title) in general seems like an overstatement.
If it's NP-hard to even know whether the answer is bigger or smaller than a certain number, then it's obvious that finding the exact answer is at least as hard as NP-hard, whatever that means.
Hi, I'm Cognetta from the above Cognetta et al. I can't answer all of your questions (and I can't speak for the authors of this paper ofc), but I will try to answer some.
> Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.
Compression isn't necessarily the best metric for language modeling quality [1][2][3], but there are some papers that find a correlation between it and quality [4] and also it has one important benefit: it reduces inference time by making the input sequences shorter (this is particularly important for transformers, because the runtime is quadratic in the sequence length).
If you imagine that with enough data, basically any reasonable tokenization algorithm would be ok (I think this is mostly true; there are definitely bad and "better" tokenizers and you see this very clearly in small data settings, but once you get into the trillions-of-tokens and 10s-of-billion-of-parameters setting, other things are going to matter more), then optimizing the tokenizer for compression is a good choice as it will provide tangible, practical benefits in the sense of reduced inference time.
> I'm also not sure how realistic the limitation to "at most δ symbols" is. [...] But why not just keep adding tokens as needed, rather than imposing any preordained limit?
This is a pretty realistic limitation imo. Of course you can arbitrarily increase the vocabulary size, but there is a tradeoff between modeling quality, parameter count, and inference time. If you increase the vocabulary a bunch, your inference speed will probably improve (although now you have a much larger softmax at the end of your model, which isn't usually a bottleneck anymore, but still not great), parameter count will increase (due to the larger embedding table), and your modeling quality will go down (in that you have tokens which are so rare in the corpus that they are massively undertrained; this can cause big problems [5]).
So by constraining it to δ, you are basically setting a parameter budget for the vocabulary, and this is a pretty reasonable thing to do.
> IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings.
Yeah, the size of the vocabulary varies a lot across models, but it isn't unusual to see significantly larger vocabularies these days (e.g., gemma has ~256k). However, these are still finite and very small compared to the corpus size.
> How could you possibly choose a meaningful δ from first principles?
This is a really great question, and something that we don't know how to answer. A lot of work has tried to answer it [6][7], but it is very much an open question.
[1]: https://arxiv.org/abs/2310.08754
[2]: https://aclanthology.org/2023.acl-long.284/
[3]: https://aclanthology.org/2024.emnlp-main.40/
[4]: https://arxiv.org/abs/2403.06265
[5]: https://aclanthology.org/2024.emnlp-main.649/
NB: Can't edit my original reply.
Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.
To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.
So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".
Thanks for these detailed replies! Now I really want to read your paper.
Thanks!
Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.
The original paper that proposes the metric is, imo, much more interesting theoretically [2].
Is this not kind of trivial, at least the way they've defined it? It's kind of very obviously NP complete to me. Any of these text problems where you're tying to optimize over a whole corpus is kind of not hard to see to be NP-complete, similar to longest common subsequence.
Longest common substring is linear time.
You may have misread the parent comment. Longest common substring is not the same type of problem as longest common subsequence.
For those who were wondering what this means:
common substring: contiguous
common subsequence: not necessarily contiguous but in order
Longest common subsequence can be solved in at most quadratic time via dynamic programming.
Longest common subsequence for an arbitrary input is NP-hard: https://en.wikipedia.org/wiki/Longest_common_subsequence
Some narrow versions of it could be optimally solved with DP, but when the constraints are lifted, you will probably have to pay for it with either exponential memory or time.
Same applies to the Knapsack problem. You can solve some variants of it with DP, but it won't generalize.
I wouldn't call the case with two sequence "some narrow version". It is by far the most common instance of the problem and honestly the first thing that pops into my mind when I hear LCS, that's why I was very surprised to hear it was NP-hard but it does make sense for an arbitrary number of sequences.
For a second I thought this was related to DeFi... and then I read the abstract.
I thought it was about compiler parsers. That would be surprising.