LLaVA-CoT

is a novel Vision-Language Model (VLM) designed to perform autonomous, multi-stage reasoning, enabling it to tackle complex visual question-answering tasks by independently engaging in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation.

Package 2.1k stars GitHub

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

If you like our project, please give us a star ⭐ on GitHub for the latest update.

本项目希望构建开源多模态慢思考推理大模型，由北大深研院袁粒老师课题组研发。

📣 News

[2025/07/01] LLaVA-CoT has been accepted by ICCV 2025!
[2025/04/13] We have updated the inference script that does not rely on VLMEvalKit, based on the suggestion in this issue.
[2025/01/08] We released the full training code.
[2025/01/02] We discovered that when testing with the AI2D benchmark, we were using AI2D_TEST_NO_MASK, while the VLMEvalKit utilizes AI2D_TEST. We previously overlooked the distinction between the two, and we sincerely apologize for this oversight. We will make the necessary corrections.
[2024/11/28] We've released the dataset: https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
[2024/11/25] We've released the code for dataset generation: dataset_generation/generate.py
[2024/11/23] We've released the Gradio App: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
[2024/11/20] We've released the pretrained weights: https://huggingface.co/Xkev/Llama-3.2V-11B-cot
[2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
[2024/11/18] Welcome to watch 👀 this repository for the latest updates.

🔥 Highlights

LLaVA-CoT is a visual language model capable of spontaneous, systematic reasoning.

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

🚀 Demos

LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.

Reasoning Problems

Question	Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6
Llama-3.2-11B-Vision-Instruct	To determine the correct answer, we need to follow the instructions provided. The first instruction is to subtract all tiny shiny balls. Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action. The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects. Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.
LLaVA-CoT	*Summary stage: What's the problem? What should I do?* <SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY> *Caption stage: What can I know from the image?* <CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION> *Reasoning stage: How to solve the problem step-by-step?* <REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING> *Conclusion stage: What is the final answer?* <CONCLUSION> B </CONCLUSION>

Question

Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6

Llama-3.2-11B-Vision-Instruct

To determine the correct answer, we need to follow the instructions provided.
The first instruction is to subtract all tiny shiny balls.
Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action.
The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects.
Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.

LLaVA-CoT

Summary stage: What's the problem? What should I do?
<SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B </CONCLUSION>

Science Problems

Question	An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 , \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis. Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa
Llama-3.2-11B-Vision-Instruct	To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis: ( G = \frac{F}{A \gamma} ) where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain. Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is: [ \gamma = \frac{\Delta x}{L} = \frac{2 , \text{mm}}{400 , \text{mm}} = 0.005 ] Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is: [ A = 100 , \text{mm} \times 200 , \text{mm} = 20000 , \text{mm}^2 ] Now, we can substitute these values into the formula for shear modulus: [ G = \frac{F}{A \gamma} = \frac{2000 , \text{N}}{20000 , \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 , \text{MPa} ] Therefore, the correct answer is C: 20 MPa.
LLaVA-CoT	*Summary stage: What's the problem? What should I do?* <SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY> *Caption stage: What can I know from the image?* <CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 , \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION> *Reasoning stage: How to solve the problem step-by-step?* <REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula: [ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ] where: - ( F ) is the applied force (2 kN), - ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²), - ( \Delta x ) is the horizontal displacement (2 mm), - ( L ) is the height of the block (200 mm). First, convert the force to Newtons: ( F = 2 , \text{kN} = 2000 , \text{N} ). Next, calculate the shear modulus: [ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ] [ G = \frac{2000}{40,000 \cdot 0.01} ] [ G = \frac{2000}{400} ] [ G = 5 , \text{MPa} ] </REASONING> *Conclusion stage: What is the final answer?* <CONCLUSION> B: 5 MPa </CONCLUSION>

Question

An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 , \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis.
Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa

Llama-3.2-11B-Vision-Instruct

To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis:
( G = \frac{F}{A \gamma} )
where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain.
Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is:
[ \gamma = \frac{\Delta x}{L} = \frac{2 , \text{mm}}{400 , \text{mm}} = 0.005 ]
Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is:
[ A = 100 , \text{mm} \times 200 , \text{mm} = 20000 , \text{mm}^2 ]
Now, we can substitute these values into the formula for shear modulus:
[ G = \frac{F}{A \gamma} = \frac{2000 , \text{N}}{20000 , \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 , \text{MPa} ]
Therefore, the correct answer is C: 20 MPa.

LLaVA-CoT

Summary stage: What's the problem? What should I do?
<SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 , \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula:
[ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ]
where:
- ( F ) is the applied force (2 kN),
- ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²),
- ( \Delta x ) is the horizontal displacement (2 mm),
- ( L ) is the height of the block (200 mm).
First, convert the force to Newtons: ( F = 2 , \text{kN} = 2000 , \text{N} ).
Next, calculate the shear modulus:
[ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ]
[ G = \frac{2000}{40,000 \cdot 0.01} ]
[ G = \frac{2000}{400} ]
[ G = 5 , \text{MPa} ]
</REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B: 5 MPa </CONCLUSION>

🛠️ Usage

Pretrained weights

You can download the pretrained weights from the Huggingface: Xkev/Llama-3.2V-11B-cot.

Dataset

You can download the dataset from the Huggingface: Xkev/LLaVA-CoT-100k.

Inference

You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.

If you want to use perform inference time scaling, you can refer to the detailed instructions provided in this file.

Finetuning

You may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.

We recommend using llama-recipes.

To reproduce our results, you can use the following command:

cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 4 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "datasets/cot_dataset.py"  --run_validation False --batching_strategy padding

Remember to modify the data_path and image_base_path in train/cot_dataset.py to your own path (the path to the training dataset).

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@InProceedings{Xu_2025_ICCV,
    author    = {Xu, Guowei and Jin, Peng and Wu, Ziang and Li, Hao and Song, Yibing and Sun, Lichao and Yuan, Li},
    title     = {LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {2087-2098}
}

🙏 Acknowledgement

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
The template is modified from Chat-Univi and LLaVA.

Back to VLM Architectures