KOSMOS-2: Grounding Multimodal Large Language Models to the World > KOSMOS-2: Grounding Multimodal Large Language Models to the World

Extending the KOSMOS-1 architecture, incorporates grounded image-text pairs using discrete location tokens linked to text spans, effectively anchoring text to specific image regions, thereby enhancing multimodal understanding and reference accuracy.

Package 22.1k stars GitHub

Back to VLM Architectures