Multimodal LLMs

Discussions center on whether LLMs are multimodal, how they process images via vision models or transformers, and distinctions from pure text-based models.

➡️ Stable 1.0x AI & Machine Learning
2,588
Comments
14
Years Active
5
Top Authors
#209
Topic ID

Activity Over Time

2010
1
2014
7
2015
4
2016
12
2017
15
2018
16
2019
11
2020
28
2021
70
2022
214
2023
693
2024
649
2025
842
2026
26

Keywords

LLM rickard.com COCO i.e E.g DeepSeek imgur.co CLIP ControlNets HiDream multimodal image models text llms modal llm diffusion model images

Sample Comments

m3kw9 Feb 7, 2025 View on HN

You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

nothrowaways Sep 25, 2025 View on HN

Multimodal models do it already.

2-3-7-43-1807 Feb 8, 2025 View on HN

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

kadushka Mar 15, 2025 View on HN

Most modern LLMs are multimodal.

hnuser123456 Oct 18, 2024 View on HN

The multimodal models can do more than predict next words.

jnwatson May 16, 2025 View on HN

Multimodal LLMs are absolutely LLMs, the language is just not human language.

visarga Dec 27, 2023 View on HN

Why not? We have multi-modal models as well. Not pure text.

ctbellmar Oct 14, 2025 View on HN

Pawel,This looks promising! Is it for text based models only at this time (i.e. no vision/image training)?

torginus Oct 28, 2025 View on HN

Most SOTA LLMs are multimodal transformers.

mupuff1234 Jun 26, 2023 View on HN

Is there really a difference between a multimodal vs a text LLM + stable diffusion?