VLM Architectures

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia

GitHub

ARIA: An Open Multimodal Native Mixture-of-Experts Model

GitHub 1.1k updated 1y ago

Open In Spaces

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

EVEv2 368 updated 10mo ago

represents a significant advancement in encoder-free vision-language models (VLMs), addressing limitations of previous approaches by introducing a "Divide-and-Conquer" architecture that maximizes scaling efficiency, reduces inter-modality interference, and achieves strong performance with superior data efficiency.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

GitHub 17.7k updated 1y ago

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

LLaVA-CoT 2.1k updated 5mo ago

is a novel Vision-Language Model (VLM) designed to perform autonomous, multi-stage reasoning, enabling it to tackle complex visual question-answering tasks by independently engaging in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

LLM2CLIP 643 updated 3mo ago

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu

Maya: An Instruction Finetuned Multilingual Multimodal Model

GitHub 125 updated 9mo ago

MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax-01 3.4k updated 10mo ago

A series of large foundation models, including MiniMax-Text-01 and MiniMax-VL-01, that achieve performance comparable to top-tier models (like GPT-4o and Claude-3.5-Sonnet) while offering significantly longer context windows (up to 4 million tokens). It achieves this through a novel architecture incorporating lightning attention (a highly efficient linear attention variant), Mixture of Experts (MoE), and optimized training/inference frameworks.

NVLM: Open Frontier-Class Multimodal LLMs

GitHub 16.2k updated 28d ago

NVIDIA/Megatron-LM/tree/NVLM-1.0

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

GitHub 8.0k updated 1mo ago

Pixtral 12B: A Cutting-Edge Open Multimodal Language Model

GitHub

Mistral AI Science Team

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Sa2VA 1.6k updated 2mo ago

Sa2VA is a unified model for dense grounded understanding of both images and videos, integrating the SAM-2 video segmentation model with the LLaVA vision-language model. It supports a wide array of image and video tasks, like referring segmentation and conversation, by treating all inputs (text, images, videos) as tokens in a shared LLM space, generating instruction tokens that guide SAM-2 for precise mask production.

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

GitHub 532 updated 9mo ago

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

GitHub 10.2k updated 4mo ago

bytedance/UI-TARS

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

VideoChat-Flash 509 updated 6mo ago

is a system designed for handling long-form video content in multimodal large language models (MLLMs). It introduces a Hierarchical visual token Compression (HiCo) method to reduce computational load while preserving essential details, and uses a multi-stage learning approach with a new long-video dataset (LongVid) to achieve state-of-the-art performance on both long and short video benchmarks.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

GitHub 1.1k updated 9mo ago

Llama 3.2-Vision: Enhanced Multimodal Capabilities Built on Llama 3

GitHub 7.5k updated 3mo ago

SmolVLM: A Small, Efficient, and Open-Source Vision-Language Model

GitHub 3.7k updated 4mo ago

SmolVLM is a 2B parameter vision-language model (VLM) that achieves state-of-the-art performance for its memory footprint, offering a small, fast, and memory-efficient solution for multimodal tasks. It is fully open-source, with all model checkpoints, datasets, training recipes, and tools released under the Apache 2.0 license, enabling local deployment, reduced inference costs, and user customization.

Idefics3-8B: Building and Better Understanding Vision-Language Models

Open In Spaces

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer-2.5 2.9k updated 1y ago

A versatile Large Vision Language Model (LVLM) designed to handle long-contextual input and output, excelling in various text-image comprehension and composition tasks. It achieves performance comparable to GPT-4V with a significantly smaller 7B LLM backend, demonstrating its efficiency and scalability.

InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

GitHub 9.9k updated 8mo ago

DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL 4.1k updated 2y ago

Employs a hybrid vision encoder architecture, fusing a SigLIP-L encoder for semantic understanding with a SAM-B encoder for high-resolution detail extraction. This allows for efficient processing of 1024x1024 images while capturing both global and fine-grained visual features. A two-layer hybrid MLP adapter then integrates these features with the DeepSeek LLM backbone. The model is pre-trained on a diverse dataset encompassing web screenshots, PDFs, OCR, charts, and knowledge-based content from sources like Common Crawl, Web Code, E-books, and arXiv articles. This pretraining is further refined using a curated instruction-tuning dataset based on real user scenarios and categorized into a comprehensive taxonomy covering recognition, conversion, analysis, reasoning, evaluation, and safety tasks. By combining this diverse data with its unique architecture and fusion strategies, DeepSeek-VL aims to deliver robust performance across a wide range of real-world vision-language applications.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2 5.3k updated 1y ago

is an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, by incorporating a dynamic tiling vision encoding strategy for high-resolution images and leveraging DeepSeekMoE models with Multi-head Latent Attention for efficient inference. It is trained on a large vision-language dataset, shows top performance in tasks.

MANTIS: Mastering Multi-Image Understanding Through Interleaved Instruction Tuning

Mantis 239 updated 4mo ago

a family of open-source large multimodal models that demonstrate state-of-the-art performance on multi-image visual language tasks.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

GitHub 6.6k updated 1y ago

Qwen2-VL: A Powerful Open-Source Vision-Language Model for Image and Video Understanding

GitHub 18.8k updated 3mo ago

Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series

GitHub 19.1k updated 3mo ago

Qwen Team

Moondream-next: Compact Vision-Language Model with Enhanced Capabilities

GitHub 9.5k updated 6mo ago

Moondream

GitHub 2.8k updated 1y ago

BLIP: Bootstrapping Language-Image Pre-training

BLIP 5.7k (archived)

Introduces an innovative approach to unified vision-language understanding and generation through its Multimodal Mixture of Encoder-Decoder (MED) architecture. This architecture is designed to be highly versatile, capable of serving as a unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. This flexibility allows BLIP to adeptly handle a wide array of vision-language tasks, showcasing its adaptability across various applications. The MED architecture incorporates a Visual Transformer to encode images, a BERT-based text encoder for processing textual information, additional cross-attention layers to facilitate image-text interaction, and causal self-attention layers for generating text based on image inputs. These components enable BLIP to support three key functionalities: encoding of either modality on its own, encoding of text grounded in images, and decoding of text from images, thus covering a comprehensive range of tasks from understanding to generation.BLIP's training methodology is grounded in the joint optimization of three pre-training objectives: Image-Text Contrastive Learning (ITC), Image-Text Matching (ITM), and Image-Conditioned Language Modeling (LM). These objectives are designed to align visual and textual features, learn fine-grained image-text alignment, and enable text generation from images, respectively. The model utilizes a mix of human-annotated and web-collected noisy image-text pairs for training, balancing the precision of manually annotated data with the scale and diversity of data collected from the web. This approach ensures robustness and scalability in BLIP's performance across vision-language tasks.For alignment and fusion of multimodal information, BLIP employs ITC and ITM losses to achieve precise text-image alignment, utilizing a multimodal representation that accurately captures the nuanced relationship between visual and textual data. The architecture's cross-attention layers play a crucial role in incorporating visual information into the text encoder for image-grounded text encoding. Simultaneously, modifications to the self-attention layers in the decoder facilitate text generation, effectively merging vision and text for unified processing. BLIP's pre-training leverages a diverse set of datasets, including COCO, Visual Genome, Conceptual Captions, Conceptual 12M, SBU Captions, and LAION. These datasets are instrumental in learning a broad spectrum of vision-language tasks, with high-quality human-annotated pairs and extensive web datasets providing the necessary depth and breadth for comprehensive pre-training.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

InstructBLIP 11.2k updated 1y ago

enhances the BLIP-2 framework by introducing instruction tuning to its Query Transformer (Q-Former), enabling the model to extract instruction-aware visual features and achieve state-of-the-art zero-shot performance across diverse vision-language tasks.

KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models

KOSMOS-2: Grounding Multimodal Large Language Models to the World

KOSMOS-2: Grounding Multimodal Large Language Models to the World 22.1k updated 4mo ago

Extending the KOSMOS-1 architecture, incorporates grounded image-text pairs using discrete location tokens linked to text spans, effectively anchoring text to specific image regions, thereby enhancing multimodal understanding and reference accuracy.

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

GitHub 127 updated 1y ago

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng

Parrot: Multilingual Visual Instruction Tuning

GitHub 77 updated 11mo ago

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

GitHub 1.3k updated 7mo ago

INF-LLaVA: High-Resolution Image Perception for Multimodal Large Language Models

INF-LLaVA 42 updated 1y ago

High-Resolution Image Perception for Multimodal Large Language Models

VILA²: VILA Augmented VILA

VILA²

(VILA-augmented-VILA) introduces a novel approach to address the limitations of data quantity and quality in training Visual Language Models (VLMs). Instead of relying on costly human annotation or distillation from proprietary models, VILA² leverages the VLM itself to iteratively refine and augment its pretraining data, leading to significant performance improvements and achieving state-of-the-art results on the MMMU leaderboard among open-sourced models.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V 24.4k updated 29d ago

A series of efficient Multimodal Large Language Models (MLLMs) designed for deployment on end-side devices like mobile phones and personal computers. The latest iteration, MiniCPM-Llama3-V 2.5, achieves performance comparable to GPT-4V, Gemini Pro, and Claude 3 while being significantly smaller and more efficient, demonstrating the feasibility of deploying powerful MLLMs on resource-constrained devices.

MiniCPM-o-2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming

GitHub 24.4k updated 29d ago

OpenBMB

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision

is a family of open large multimodal models (LMMs) designed to excel in various computer vision scenarios, including single-image, multi-image, and video understanding. It pushes the performance boundaries of open LMMs by consolidating insights from the LLaVA-NeXT blog series, focusing on data, models, and visual representations. Notably, LLaVA-OneVision demonstrates strong transfer learning capabilities, enabling it to excel in video understanding tasks by leveraging knowledge learned from image data.

VITA: Towards Open-Source Interactive Omni Multimodal LLM

VITA 2.5k updated 1y ago

is the first open-source Multimodal Large Language Model (MLLM) capable of simultaneously processing and analyzing video, image, text, and audio modalities while offering an advanced multimodal interactive experience. It addresses the limitations of existing open-source models, which often excel in either understanding or interaction but rarely both, by integrating architectural innovations with advanced training and development strategies.

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

GitHub 936 updated 7mo ago

MULTIINSTRUCT 134 updated 2y ago

Leverages the OFA model as its foundation, employing a Transformer-based sequence-to-sequence architecture and instruction tuning techniques on a diverse dataset, effectively aligning text and image tokens within a unified space for enhanced multi-modal zero-shot learning.

MouSi: Poly-Visual-Expert Vision-Language Models

MouSi 75 updated 2y ago

pushes the boundaries of VLMs by incorporating multiple visual experts like CLIP and SAM, utilizing a poly-expert fusion network to combine their outputs and interface with powerful LLMs like Vicuna, thereby enabling a more comprehensive understanding and processing of visual information.

LaVIN: Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

GitHub 523 updated 2y ago

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

GitHub 1.3k updated 3mo ago

GLaMM: Pixel Grounding Large Multimodal Model

GitHub 950 updated 9mo ago

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

COSMO

This framework is distinctive for its architecture that merges a visual encoder, leveraging the Vision Transformer (ViT) from Open-CLIP, with a partitioned Large Language Model (LLM). The LLM is systematically divided into segments dedicated to unimodal text processing and multimodal data handling, aiming to streamline the overall processing of interleaved data sequences. The introduction of an additional contrastive loss component stands out as a strategy to improve performance across both classification and generation tasks. Training of COSMO is carried out through a unique combination of language modeling loss and contrastive loss, focusing on the efficient management of interleaved text and visual sequences. This process is optimized with the use of the AdamW optimizer, a cosine learning rate schedule, and the implementation of DeepSpeed fp16 precision, distributed across 128 NVIDIA V100 GPUs. The partitioning strategy of the LLM into dedicated components is a testament to the framework's commitment to computational efficiency and efficacy in handling extensive data sequences. The model's alignment techniques are notably advanced, featuring a learnable query that facilitates global attention across all tokens, alongside an additional query for Text Fusion Layers, optimizing the model's understanding of token sets and enhancing image-text alignment through contrastive loss. The gated cross-attention layers for multimodal fusion introduce a significant reduction in learnable parameters by introducing bottlenecks in input and output feature channels. This method of lightweight fusion is pivotal in integrating visual information for precise next-token prediction. COSMO's training leverages a diverse array of datasets including CC3M, SBU, LAION400M, DataComp1B, MMC4, WebVid, and Howto-Interlink7M. The introduction of Howto-Interlink7M, in particular, underscores the model's innovative approach to improving video-language understanding through high-quality annotated captions, demonstrating its effectiveness across 14 diverse downstream tasks.

GitHub 135 updated 1y ago

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

MoE-LLaVA 2.3k updated 10mo ago

MoE-LLaVA introduces a novel approach by incorporating Mixture of Experts (MoE) within a large vision-language model, using learnable routers to selectively activate expert modules for processing specific tokens, thereby enhancing efficiency and enabling nuanced understanding of multimodal inputs.

BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

BLIVA 260 updated 2y ago

This model builds upon the foundation of InstructBLIP, incorporating a Visual Assistant to enhance its understanding and processing of text-rich visual contexts. BLIVA's architecture is designed to capture the intricacies of visual content that may be overlooked during the query decoding process by melding learned query embeddings from InstructBLIP with directly projected encoded patch embeddings. The core components of BLIVA include a vision tower, responsible for encoding visual inputs into patch embeddings; a Q-former, which refines query embeddings; and a projection layer that bridges the visual and linguistic domains, enabling the LLM to access a rich tapestry of visual knowledge. The training methodology of BLIVA is structured around a two-stage scheme: initial pre-training on image-text pairs derived from captioning datasets, followed by instruction tuning using Visual Question Answering (VQA) data. This process begins with the pre-training of the projection layer for patch embeddings, succeeded by the fine-tuning of both the Q-former and the projection layer, while the image encoder and LLM remain static to prevent catastrophic forgetting. This approach ensures that BLIVA is finely attuned to visual information, enhancing its ability to handle complex visual questions. BLIVA's alignment techniques and fusion methods stand out for their integration of learned query embeddings with an additional visual assistant branch that utilizes encoded patch embeddings. By concatenating these embeddings and feeding them directly into the LLM, BLIVA significantly improves the model's text-image visual perception capabilities. This enhanced multimodal understanding is further demonstrated through the use of diverse datasets, including image captioning datasets for pre-training, instruction tuning VQA data for performance enhancement, and YTTB-VQA (YouTube Thumbnail Visual Question-Answer pairs) to showcase BLIVA's proficiency in processing text-rich images and its suitability for real-world applications.

MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices

mobilevlm 1.3k updated 2y ago

A Fast, Strong and Open Vision Language Assistant for Mobile Devices

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

MiniGPT-v2 1.2k updated 4mo ago

A sophisticated model designed to serve as a unified interface for vision-language multi-task learning, leveraging the innovative integration of a visual backbone with a large language model. At its core, the architecture combines a Visual Transformer (ViT) as its visual backbone, which is kept static during training, with a linear projection layer that effectively merges every four neighboring visual tokens into one. These consolidated tokens are then projected into the feature space of LLaMA-2-chat, a 7-billion parameter language model, facilitating the processing of high-resolution images (448x448 pixels). This structure allows MiniGPT-v2 to efficiently bridge the gap between visual input and language model processing, catering to a wide array of vision-language tasks. The training methodology employed by MiniGPT-v2 is particularly noteworthy, encompassing a three-stage strategy to comprehensively cover the spectrum of knowledge acquisition and task-specific performance enhancement. Initially, the model is exposed to a mix of weakly-labeled and fine-grained datasets, focusing on broad vision-language understanding. The training progressively shifts towards more fine-grained data to hone in on specific task improvements. In the final stage, MiniGPT-v2 is trained on multi-modal instruction and language datasets, aiming to refine its response to multi-modal instructions. The use of task-specific identifier tokens during training plays a crucial role in reducing ambiguity and sharpening task distinction, enabling the model to adeptly navigate the complexities of vision-language tasks. To support its extensive training and operational capabilities, MiniGPT-v2 utilizes a diverse array of datasets, including LAION, CC3M, SBU, GRIT-20M, COCO caption, and several others, each selected to fulfill distinct stages of the training process—from broad knowledge acquisition to task-specific improvements and sophisticated multi-modal instruction handling. This strategic dataset employment underscores MiniGPT-v2's capacity to assimilate and apply knowledge across a broad range of vision-language contexts, positioning it as a versatile tool in the evolving landscape of multi-task learning interfaces.

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

OpenFlamingo 4.1k updated 1y ago

An open-source adaptation of DeepMind's Flamingo, combines a CLIP ViT-L/14 visual encoder with a 7B parameter language model, utilizing frozen cross-attention modules for efficient and effective multimodal fusion during the decoding process, resulting in impressive performance on various vision-language tasks.

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

PALI3 146 updated 2mo ago

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

GitHub 25.8k updated 1y ago

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

GitHub 765 updated 2y ago

BakLLaVA

GitHub 719 updated 2y ago

CogVLM: Visual Expert for Pretrained Language Models

CogVLM 6.7k updated 2y ago

This approach enables the model to deeply fuse vision-language features, enhancing its ability to process and understand multimodal inputs. The architecture of CogVLM is built around several key components: a Vision Transformer (ViT) encoder, an MLP adapter, a pretrained large language model akin to GPT, and the innovative visual expert module. These components work in tandem to facilitate the model's advanced capabilities in handling complex visual and textual information. The training methodology for CogVLM is comprehensive, encompassing both pretraining and fine-tuning phases. During pretraining, the model undergoes learning with a focus on image captioning loss and Referring Expression Comprehension (REC) across an extensive dataset comprising over 1.5 billion image-text pairs and a visual grounding dataset featuring 40 million images. The fine-tuning phase employs a unified instruction-supervised approach across a variety of visual question-answering datasets, further refining the model's performance. CogVLM's alignment techniques are particularly noteworthy, employing a visual expert module in each layer that leverages a QKV (Query, Key, Value) matrix and an MLP (Multilayer Perceptron) to achieve deep visual-language feature alignment. This method not only allows for the seamless integration of image features into the language model's processing layers but also significantly enhances the model's overall multimodal processing capabilities. The datasets employed in training and refining CogVLM include LAION-2B, COYO-700M, a visual grounding dataset of 40 million images, and several visual question-answering datasets like VQAv2, OKVQA, TextVQA, OCRVQA, and ScienceQA. These datasets serve multiple purposes, from pretraining and instruction alignment to enhancing the model's proficiency in tasks such as image captioning and referring expression comprehension. Through this strategic use of diverse datasets, CogVLM is positioned to excel in a wide array of multimodal tasks, marking a significant advancement in the field of vision-language models.

CogVLM2: Enhanced Vision-Language Models for Image and Video Understanding

CogVLM2 2.4k updated 1y ago

Enhanced Vision-Language Models for Image and Video Understanding

Ferret: Refer and Ground Anything Anywhere at Any Granularity

GitHub 8.7k updated 1y ago

OtterHD: A High-Resolution Multi-modality Model

OtterHD 3.4k updated 2y ago

Represents an evolutionary step in multi-modality model design, building on the foundation of the Fuyu-8B architecture to interpret high-resolution visual inputs with exceptional precision. Unlike traditional models limited by fixed-size vision encoders, OtterHD-8B is equipped to handle flexible input dimensions, allowing for enhanced versatility across a variety of inference requirements. This model integrates pixel-level visual information directly into the language model without the need for a separate vision encoder, employing position embeddings to comprehend varying image sizes and enabling the processing of high-resolution images up to 1024x1024 pixels. Instruction tuning in OtterHD-8B is tailored towards accommodating various image resolutions, with the model being trained on a diverse dataset mixture including LLaVA-Instruct, VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA, COCO-GOI, COCO-Caption, TextQA, RefCOCO, COCO-ITM, ImageNet, and LLaVA-RLHF. This training employs FlashAttention-2 and other fused operators for optimization, leveraging PyTorch and HuggingFace transformers. The direct integration of pixel-level information into the language model, facilitated by position embeddings, enables OtterHD-8B to understand and generate responses to high-resolution images alongside textual instructions without conventional vision and text embedding fusion methods. The datasets chosen for training OtterHD-8B underscore its focus on a broad array of vision and language tasks, including question answering, object recognition, and text-image alignment, aiming to enhance the model's capabilities in these areas. By directly processing image patches alongside textual instructions, OtterHD-8B eschews traditional fusion methods, leveraging its architecture to interpret and respond to complex multimodal inputs. This approach not only marks a significant advancement in handling high-resolution images but also in the model's overall ability to comprehend and interact with visual and textual data, positioning OtterHD-8B as a notable development in the field of multi-modality models.

CLIP: Contrastive Language-Image Pre-training

CLIP 33.4k updated 2mo ago

leverages a contrastive learning approach, training separate image and text encoders on a massive dataset of 400 million image-text pairs to predict the most relevant captions for images, enabling impressive zero-shot transfer capabilities to various downstream tasks without requiring task-specific training data.

MetaCLIP: Demystifying CLIP Data

GitHub 1.8k updated 6mo ago

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

GitHub 869 updated 10mo ago

Link to the official GitHub repository.

GLIP: Grounded Language-Image Pre-training

GLIP 2.6k updated 2y ago

Grounded Language-Image Pre-training

ImageBind: One Embedding Space To Bind Them All

GitHub 9.0k updated 6mo ago

ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

GitHub 12.4k updated 2mo ago

LLaVA 1.6: LLaVA-NeXT Improved reasoning, OCR, and world knowledge

LLaVA 1.6

LLaVA-NeXT advances on LLaVA-1.5 by incorporating high-resolution image processing, enhancing visual reasoning and OCR capabilities, while maintaining a data-efficient design through knowledge transfer from its predecessor and a refined training process.

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo

Pioneers a Perceiver-based VLM architecture that utilizes a Perceiver Resampler and gated cross-attention dense layers, enabling it to process interleaved text and visual sequences for impressive few-shot learning performance across a variety of multimodal tasks.