Multimodal LLMs

Discussions center on whether LLMs are multimodal, how they process images via vision models or transformers, and distinctions from pure text-based models.

➡️ Stable 1.0x AI & Machine Learning

2,588

Comments

Years Active

Top Authors

#209

Topic ID

Activity Over Time

2010

2014

2015

2016

2017

2018

2019

2020

2021

2022

214

2023

693

2024

649

2025

842

2026

Top Contributors

famouswaffles (43) GaggiX (29) simonw (29) visarga (28) ilaksh (24)

Keywords

LLM rickard.com COCO i.e E.g DeepSeek imgur.co CLIP ControlNets HiDream multimodal image models text llms modal llm diffusion model images

Sample Comments

m3kw9 • Feb 7, 2025 • View on HN

You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

nothrowaways • Sep 25, 2025 • View on HN

Multimodal models do it already.

2-3-7-43-1807 • Feb 8, 2025 • View on HN

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

kadushka • Mar 15, 2025 • View on HN

Most modern LLMs are multimodal.

hnuser123456 • Oct 18, 2024 • View on HN

The multimodal models can do more than predict next words.

jnwatson • May 16, 2025 • View on HN

Multimodal LLMs are absolutely LLMs, the language is just not human language.

visarga • Dec 27, 2023 • View on HN

Why not? We have multi-modal models as well. Not pure text.

ctbellmar • Oct 14, 2025 • View on HN

Pawel,This looks promising! Is it for text based models only at this time (i.e. no vision/image training)?

torginus • Oct 28, 2025 • View on HN

Most SOTA LLMs are multimodal transformers.

mupuff1234 • Jun 26, 2023 • View on HN

Is there really a difference between a multimodal vs a text LLM + stable diffusion?