KOSMOS-2: Grounding Multimodal Large Language Models to the World > KOSMOS-2: Grounding Multimodal Large Language Models to the World
Extending the KOSMOS-1 architecture, incorporates grounded image-text pairs using discrete location tokens linked to text spans, effectively anchoring text to specific image regions, thereby enhancing multimodal understanding and reference accuracy.