Large Vision Models

The next big thing after LLMs ought to be LVMs or Large Vision Models. … Visual data is an order of magnitude larger than textual data. I can only imagine the resources that will be needed to capture, transmit, process and store this data and train these models. This will be a wave bigger than even the one created by LLMs. Maybe OpenAI is already on it and that’s what the trillions of dollars are needed for.


LLMs already are LVMs. That’s why they don’t work.

Part of language is about parsing a speech signal into something meaningful with your ears. The other part is about creating things that other people can parse. Text is incidental. It may be a transcription or carefully prepared.

These models rarely produce ungrammatical sentences, which shouldn’t be surprising given the excessive amounts of data they’ve seen, but they cannot be said to understand language. You cannot bootstrap a semantics this way. They do not, as some recent commentators have suggested, demonstrate linguistic competence.

Of course the next big architectural achievement in machine learning will do wonders with visual data and short-form video content. The Apple Vision Pro will be available for $5,499 in rose gold. What else is there to do with all this hardware?

Hopefully it’s clear by now that more data and more params isn’t going to work. At least not for language.