Meanings Are Tiktokens In Space

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We’ll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Let’s build the GPT Tokenizer

It’s pretty wild how little discussion there’s been about the core feature of these models. It’s as if this aspect of their development has been solved. Basically all NLP publications today take these BPE tokens as a starting point and if they are mentioned at all they’re mentioned in passing.

Since this data compression technique has no linguistic basis and the tokens that result are mapped onto the latent space of images it is at least as accurate to describe these models as large vision models as it is to describe them as large language models. I’d be hard pressed to come up with a behavior that couldn’t be traced back to this foundational representation of large language models.

I think some kind of representation is necessary so I’m not sure how you could eliminate this aspect completely but I think the following items are worth considering: typed syntactic categories and atomic primitives, not directions in space; a formalism that is expressive, not cosine distance; cross-modal mappings grounded in reality, not what works best for knowledge panels and domain adaptation, not annotating nouns with bounding boxes; meanings as proofs, not clusters of semantic attributes in space.

I hope I’m wrong but all signs point to another ten years of word2vec for short-form video content (tiktokens!) and the Apple Vision Pro in rose gold.