Project Awesome project awesome

KOSMOS-2: Grounding Multimodal Large Language Models to the World

extending the KOSMOS-1 architecture, incorporates grounded image-text pairs using discrete location tokens linked to text spans, effectively anchoring text to specific image regions, thereby enhancing multimodal understanding and reference accuracy.

Package 22.1k stars GitHub
Back to VLM Architectures