Multimodal LLMs
Discussions center on whether LLMs are multimodal, how they process images via vision models or transformers, and distinctions from pure text-based models.
Activity Over Time
Top Contributors
Keywords
Sample Comments
You don’t really feed images to LLMs, rather to a vision model within the multi modal llm
Multimodal models do it already.
the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.
Most modern LLMs are multimodal.
The multimodal models can do more than predict next words.
Multimodal LLMs are absolutely LLMs, the language is just not human language.
Why not? We have multi-modal models as well. Not pure text.
Pawel,This looks promising! Is it for text based models only at this time (i.e. no vision/image training)?
Most SOTA LLMs are multimodal transformers.
Is there really a difference between a multimodal vs a text LLM + stable diffusion?