Simply augmenting the text with bounding box information via additive positional encoding may not capture the intricate relationships between text semantics and spatial layout, especially for visually rich documents.
[DocLLM: A layout-aware generative language model for multimodal document understanding
Since it’s possible to have language—and spatial concepts—without having eyes it’s a bit strange to assume that a model for reasoning about images should also work for a model of language. Maybe you can extend a model of language to reason about visual inputs, you can give it a visual language, but I don’t think pushing this in the other direction makes sense. So I implore you, in your cross-modal quest, to first consider as a starting point a system that is inherently multimodal, such as articulatory perception, before attempting to extend it to other modalities or claiming to have a system that is multimodal. Surely RGB matrices are not a prerequisite for multimodal learning. Sight is a visual encoding into language like braille is a tactile encoding.
Let’s annotate the nouns with bounding boxes.
Sounds good.
Additionally, I don’t think the part-of-speech as bounding box formalism works as a cross-modal interface. Why? Because it assumes that word embeddings work and are a good idea. While it’s true that many features of language are inherently spatial, e.g., motion predicates like run or spatial prepositions like above, this never leads to a place (bonus spatial language) where directional quantities make sense as a starting point for representing primitives in language.
You can reliably put image embeddings to work for visuospatial reasoning. Directional, numeric magnitudes are an intuitive choice for describing optical systems. Modeling language in a latent space is a suitable choice for certain kinds of analyses and for describing some of the physical properties of a speech signal. It’s an odd, convenient choice for primitives in language. It’s not intuitive and it doesn’t work. This assumption that you can somehow go directly from character spans to a semantics seems misguided.
This becomes painfully (or humorously) clear when one attempts to do things with the representation that it is not equipped to do, like pulling meaning out of it. Inferring semantic attributes from character spans mapped to directional magnitudes in a latent space of arbitrary dimensionality by virtue of their nearness to other directional magnitudes as determined by a notion of similarity that is semantically vacuous is not adequate for natural language inference (and natural language inference is more than entailment). I’m going to go out on a limb and suggest that for language types are a more appropriate formalism than directional magnitudes from physics for handling semantic content, and that proofs are a more constructive way to approach pulling meaning out of something than “semantic similarity”, whatever that is.
You must be aware of one problem with Semantic Caches: sentences with opposite meanings might have high semantic textual similarity.
Anonymous
The potential these models have for generating well-formed sentences and creating images from text can be impressive and in some cases may exceed your expectations of what you thought was possible. However it’s a bit of a parlor trick in that the latent space masquerading as a semantics creates the illusion that you have been understood, which is not something it is equipped to do. With only directional magnitudes to stand on it can only take shots in the dark. If it can’t find what you asked for it will grab whatever is nearby and give you that. If it can find what you asked for but someone doesn’t want you to know what it found it may come up empty handed.
Why limit yourself to nouns and bounding boxes? Throw in some relative clauses and go for the fill tool!