Anonymous
The next big thing after LLMs ought to be LVMs or Large Vision Models. … Visual data is an order of magnitude larger than textual data. I can only imagine the resources that will be needed to capture, transmit, process and store this data and train these models. This will be a wave bigger than even the one created by LLMs. Maybe OpenAI is already on it and that’s what the trillions of dollars are needed for.
LLMs already are LVMs. That’s why they don’t work.
Part of language is about parsing a speech signal into something meaningful with your ears. The other part is about creating things that other people can parse. Text is incidental. It may be a transcription or carefully prepared. You can’t annotate some nouns, map the representation into an embedding space of visual data and call it a day. I mean you can and if you’re an information retrieval company you might want to but c’mon.