Project Awesome project awesome

VLM Architectures

Vision Language Model architectures.

Collection 1.2k stars GitHub

Contents

LLaVA: Large Language and Vision Assistant - Visual Instruction Tuning

PaLI: A Jointly-Scaled Multilingual Language-Image Model

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

NVLM: Open Frontier-Class Multimodal LLMs

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Pixtral 12B: A Cutting-Edge Open Multimodal Language Model

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Llama 3.2-Vision: Enhanced Multimodal Capabilities Built on Llama 3

SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model

More Information

Open In Spaces 3.7k updated 2mo ago

SmolVLM builds upon the architecture of Idefics3, leveraging a similar implementation in transformers but with key differences to enhance efficiency. It replaces the Llama 3.1 8B language backbone with the smaller SmolLM2 1.7B model. A more aggressive image compression strategy is employed, using a pixel shuffle strategy that reduces visual information by a factor of 9 (compared to 4x in Idefics3). This allows for 384x384 patches, and a shape-optimized SigLIP is used as the vision backbone with 14x14 inner patches. The model demonstrates superior memory usage compared to other VLMs in transformers, enabling efficient on-device inference. For instance, encoding a single image and prompt requires only 1.2k tokens, significantly less than models like Qwen2-VL. This efficiency translates to faster prefill and generation throughputs. SmolVLM achieves strong performance on benchmarks such as MMMU, MathVista, MMStar, DocVQA, and TextVQA. It also shows promising results in basic video analysis, leveraging its long context capabilities. Training involved extending the context window of SmolLM2 to 16k tokens using techniques like RoPE base value adjustment and fine-tuning on a mixture of long- and short-context datasets. A curated training dataset, largely based on The Cauldron and Docmatix, was used for the VLM training. Checkpoint selection was based on a weighted metric across multiple vision-language benchmarks. The model is integrated with VLMEvalKit for easy evaluation, and it can be readily used and fine-tuned with the transformers library. TRL integration allows for applying Direct Preference Optimization (DPO). A notebook is provided for fine-tuning on VQAv2, with options for LoRA, QLoRA, or full fine-tuning, even within the constraints of consumer GPUs.

Idefics3-8B: Building and Better Understanding Vision-Language Models

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

**InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen2-VL: A Powerful Open-Source Vision-Language Model for Image and Video Understanding

Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series

Moondream-next: Compact Vision-Language Model with Enhanced Capabilities

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

BLIP: Bootstrapping Language-Image Pre-training

Parrot: Multilingual Visual Instruction Tuning

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

MULTIINSTRUCT: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

GLaMM: Pixel Grounding Large Multimodal Model

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

COSMO

This framework is distinctive for its architecture that merges a visual encoder, leveraging the Vision Transformer (ViT) from Open-CLIP, with a partitioned Large Language Model (LLM). The LLM is systematically divided into segments dedicated to unimodal text processing and multimodal data handling, aiming to streamline the overall processing of interleaved data sequences. The introduction of an additional contrastive loss component stands out as a strategy to improve performance across both classification and generation tasks. Training of COSMO is carried out through a unique combination of language modeling loss and contrastive loss, focusing on the efficient management of interleaved text and visual sequences. This process is optimized with the use of the AdamW optimizer, a cosine learning rate schedule, and the implementation of DeepSpeed fp16 precision, distributed across 128 NVIDIA V100 GPUs. The partitioning strategy of the LLM into dedicated components is a testament to the framework's commitment to computational efficiency and efficacy in handling extensive data sequences. The model's alignment techniques are notably advanced, featuring a learnable query that facilitates global attention across all tokens, alongside an additional query for Text Fusion Layers, optimizing the model's understanding of token sets and enhancing image-text alignment through contrastive loss. The gated cross-attention layers for multimodal fusion introduce a significant reduction in learnable parameters by introducing bottlenecks in input and output feature channels. This method of lightweight fusion is pivotal in integrating visual information for precise next-token prediction. COSMO's training leverages a diverse array of datasets including CC3M, SBU, LAION400M, DataComp1B, MMC4, WebVid, and Howto-Interlink7M. The introduction of Howto-Interlink7M, in particular, underscores the model's innovative approach to improving video-language understanding through high-quality annotated captions, demonstrating its effectiveness across 14 diverse downstream tasks.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

MoE-LLaVA

Represents an innovative leap in the development of large vision-language models through the integration of Mixture of Experts (MoE) within a sophisticated architectural framework. This model is characterized by its sparse design, wherein individual tokens are directed towards a selection of experts based on learnable routers, ensuring that only the top-k experts are activated for any given token's processing. Such an approach not only enhances the model's efficiency but also its capability to handle diverse and complex data inputs by leveraging specialized processing paths for different types of information. At the heart of MoE-LLaVA's architecture are several critical components, including a vision encoder, a visual projection MLP layer, word embedding layers, multi-head self-attention blocks, feed-forward neural networks, and notably, the MoE blocks themselves. These elements are seamlessly integrated through the use of layer normalization and residual connections, establishing a robust and adaptable framework capable of deep multimodal understanding. The training methodology for MoE-LLaVA is meticulously structured in three stages, each designed to gradually enhance the model's proficiency in integrating and processing visual and textual data. This includes initial adaptation of image tokens, training of all LLM parameters excluding the vision encoder, and specialized training of the MoE layers, with the latter utilizing initialization weights from previous stages for optimal performance. Alignment techniques and fusion methods employed by MoE-LLaVA are pivotal in achieving a harmonious integration of text and image modalities. By utilizing learnable routers to dynamically allocate tokens to the most apt experts and subsequently processing these through a combination of LLM and MoE blocks, the model achieves a nuanced understanding of multimodal inputs. The datasets employed throughout the training phases—ranging from LLaVA-PT for pretraining to Hybrid-FT for multimodal instruction tuning, and LLaVA-FT for fine-tuning the MoE layers—further underscore the model's ability to refine its understanding across a broad spectrum of multimodal tasks. This strategic deployment of diverse datasets not only facilitates a comprehensive tuning of the model's capabilities but also underscores its potential in advancing the field of vision-language processing.

BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding

Ferret: Refer and Ground Anything Anywhere at Any Granularity

OtterHD: A High-Resolution Multi-modality Model

GLIP: Grounded Language-Image Pre-training

GLIP 2.6k updated 2y ago

A novel approach that innovatively unifies the tasks of object detection and phrase grounding by redefining object detection as a phrase grounding challenge. This strategic reformation allows the model to exploit extensive image-text paired datasets for pre-training, equipping it with the capability to comprehend and execute tasks that require object-level precision, language awareness, and semantically rich visual representations. At its core, GLIP's architecture is designed to deeply integrate visual and textual information, enhancing its understanding of complex visual scenes in conjunction with textual prompts. The architecture of GLIP is composed of several critical components, including a visual encoder that can either be a Convolutional Neural Network (CNN) or a Transformer, tasked with extracting features from regions or bounding boxes within images. It also includes a language encoder dedicated to processing text prompts and prediction heads (box classifier and box regressor) that are trained using classification and localization loss. A distinctive feature of GLIP is its method of deep fusion between image and text, specifically in the latter stages of encoding, which merges visual and textual information more comprehensively than traditional methods. GLIP's training methodology is as innovative as its architecture, employing a unified formulation that amalgamates detection and grounding tasks into a singular workflow. This model is trained end-to-end, optimizing losses defined for both detection (focusing on localization and classification) and grounding (centering on alignment scores between image regions and corresponding words in the prompt). Such deep integration of visual and language features during training is pivotal, facilitating the model's ability to learn effectively from paired image-text data. The datasets utilized for training GLIP, including COCO, OpenImages, Objects365, Visual Genome, Flickr30k-entities, LVIS, and PhraseCut, are meticulously selected to cover a wide array of object classes and scenarios, each serving a unique purpose from object detection and phrase grounding to instance segmentation and referring expression segmentation. Through this comprehensive training, GLIP sets a new precedent in the realm of language-image pre-training, demonstrating advanced capabilities in interpreting and interacting with both visual and textual data.

ImageBind: One Embedding Space To Bind Them All

ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale