More Information > Open In Spaces
SmolVLM builds upon the architecture of Idefics3, leveraging a similar implementation in transformers but with key differences to enhance efficiency. It replaces the Llama 3.1 8B language backbone with the smaller SmolLM2 1.7B model. A more aggressive image compression strategy is employed, using a pixel shuffle strategy that reduces visual information by a factor of 9 (compared to 4x in Idefics3). This allows for 384x384 patches, and a shape-optimized SigLIP is used as the vision backbone with 14x14 inner patches. The model demonstrates superior memory usage compared to other VLMs in transformers, enabling efficient on-device inference. For instance, encoding a single image and prompt requires only 1.2k tokens, significantly less than models like Qwen2-VL. This efficiency translates to faster prefill and generation throughputs. SmolVLM achieves strong performance on benchmarks such as MMMU, MathVista, MMStar, DocVQA, and TextVQA. It also shows promising results in basic video analysis, leveraging its long context capabilities. Training involved extending the context window of SmolLM2 to 16k tokens using techniques like RoPE base value adjustment and fine-tuning on a mixture of long- and short-context datasets. A curated training dataset, largely based on The Cauldron and Docmatix, was used for the VLM training. Checkpoint selection was based on a weighted metric across multiple vision-language benchmarks. The model is integrated with VLMEvalKit for easy evaluation, and it can be readily used and fine-tuned with the transformers library. TRL integration allows for applying Direct Preference Optimization (DPO). A notebook is provided for fine-tuning on VQAv2, with options for LoRA, QLoRA, or full fine-tuning, even within the constraints of consumer GPUs.