Minicpm Llama3 V2dot5 - Minicpm O

MiniCPM-Llama3-V 2.5

Archieve at: 2025-01-13

MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:

🔥 Leading Performance. MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max and greatly outperforms other Llama 3-based MLLMs.
💪 Strong OCR Capabilities. MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
🏆 Trustworthy Behavior. Leveraging the latest RLAIF-V method (the newest technique in the RLHF-V [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. Data released.
🌏 Multilingual Support. Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from VisCPM, MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to over 30 languages including German, French, Spanish, Italian, Korean etc. All Supported Languages.
🚀 Efficient Deployment. MiniCPM-Llama3-V 2.5 systematically employs model quantization, CPU optimizations, NPU optimizations and compilation optimizations, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a 150x acceleration in end-side MLLM image encoding and a 3x speedup in language decoding.
💫 Easy Usage. MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) GGUF format quantized models in 16 sizes, (3) efficient LoRA fine-tuning with only 2 V100 GPUs, (4) streaming output, (5) quick local WebUI demo setup with Gradio and Streamlit, and (6) interactive demos on HuggingFace Spaces.

Evaluation

<div align="center"> </div> <details> <summary>Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench. </summary> <div align="center"> <table style="margin: 0px auto;"> <thead> <tr> <th align="left">Model</th> <th>Size</th> <th>OCRBench</th> <th>TextVQA val</th> <th>DocVQA test</th> <th>Open-Compass</th> <th>MME</th> <th>MMB test (en)</th> <th>MMB test (cn)</th> <th>MMMU val</th> <th>Math-Vista</th> <th>LLaVA Bench</th> <th>RealWorld QA</th> <th>Object HalBench</th> </tr> </thead> <tbody align="center"> <tr> <td colspan="14" align="left"><strong>Proprietary</strong></td> </tr> <tr> <td nowrap="nowrap" align="left">Gemini Pro</td> <td>-</td> <td>680</td> <td>74.6</td> <td>88.1</td> <td>62.9</td> <td>2148.9</td> <td>73.6</td> <td>74.3</td> <td>48.9</td> <td>45.8</td> <td>79.9</td> <td>60.4</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">GPT-4V (2023.11.06)</td> <td>-</td> <td>645</td> <td>78.0</td> <td>88.4</td> <td>63.5</td> <td>1771.5</td> <td>77.0</td> <td>74.4</td> <td>53.8</td> <td>47.8</td> <td>93.1</td> <td>63.0</td> <td>86.4</td> </tr> <tr> <td colspan="14" align="left"><strong>Open-source</strong></td> </tr> <tr> <td nowrap="nowrap" align="left">Mini-Gemini</td> <td>2.2B</td> <td>-</td> <td>56.2</td> <td>34.2*</td> <td>-</td> <td>1653.0</td> <td>-</td> <td>-</td> <td>31.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">Qwen-VL-Chat</td> <td>9.6B</td> <td>488</td> <td>61.5</td> <td>62.6</td> <td>51.6</td> <td>1860.0</td> <td>61.8</td> <td>56.3</td> <td>37.0</td> <td>33.8</td> <td>67.7</td> <td>49.3</td> <td>56.2</td> </tr> <tr> <td nowrap="nowrap" align="left">DeepSeek-VL-7B</td> <td>7.3B</td> <td>435</td> <td>64.7*</td> <td>47.0*</td> <td>54.6</td> <td>1765.4</td> <td>73.8</td> <td>71.4</td> <td>38.3</td> <td>36.8</td> <td>77.8</td> <td>54.2</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">Yi-VL-34B</td> <td>34B</td> <td>290</td> <td>43.4*</td> <td>16.9*</td> <td>52.2</td> <td><strong>2050.2</strong></td> <td>72.4</td> <td>70.7</td> <td>45.1</td> <td>30.7</td> <td>62.3</td> <td>54.8</td> <td>79.3</td> </tr> <tr> <td nowrap="nowrap" align="left">CogVLM-Chat</td> <td>17.4B</td> <td>590</td> <td>70.4</td> <td>33.3*</td> <td>54.2</td> <td>1736.6</td> <td>65.8</td> <td>55.9</td> <td>37.3</td> <td>34.7</td> <td>73.9</td> <td>60.3</td> <td>73.6</td> </tr> <tr> <td nowrap="nowrap" align="left">TextMonkey</td> <td>9.7B</td> <td>558</td> <td>64.3</td> <td>66.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">Idefics2</td> <td>8.0B</td> <td>-</td> <td>73.0</td> <td>74.0</td> <td>57.2</td> <td>1847.6</td> <td>75.7</td> <td>68.6</td> <td>45.2</td> <td>52.2</td> <td>49.1</td> <td>60.7</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">Bunny-LLama-3-8B</td> <td>8.4B</td> <td>-</td> <td>-</td> <td>-</td> <td>54.3</td> <td>1920.3</td> <td>77.0</td> <td>73.9</td> <td>41.3</td> <td>31.5</td> <td>61.2</td> <td>58.8</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">LLaVA-NeXT Llama-3-8B</td> <td>8.4B</td> <td>-</td> <td>-</td> <td>78.2</td> <td>-</td> <td>1971.5</td> <td>-</td> <td>-</td> <td>41.7</td> <td>37.5</td> <td>80.1</td> <td>60.0</td> <td>-</td> </tr> <tr> <td nowrap="nowrap" align="left">Phi-3-vision-128k-instruct</td> <td>4.2B</td> <td>639*</td> <td>70.9</td> <td>-</td> <td>-</td> <td>1537.5*</td> <td>-</td> <td>-</td> <td>40.4</td> <td>44.5</td> <td>64.2*</td> <td>58.8*</td> <td>-</td> </tr> <tr style="background-color: #e6f2ff;"> <td nowrap="nowrap" align="left">MiniCPM-V 1.0</td> <td>2.8B</td> <td>366</td> <td>60.6</td> <td>38.2</td> <td>47.5</td> <td>1650.2</td> <td>64.1</td> <td>62.6</td> <td>38.3</td> <td>28.9</td> <td>51.3</td> <td>51.2</td> <td>78.4</td> </tr> <tr style="background-color: #e6f2ff;"> <td nowrap="nowrap" align="left">MiniCPM-V 2.0</td> <td>2.8B</td> <td>605</td> <td>74.1</td> <td>71.9</td> <td>54.5</td> <td>1808.6</td> <td>69.1</td> <td>66.5</td> <td>38.2</td> <td>38.7</td> <td>69.2</td> <td>55.8</td> <td>85.5</td> </tr> <tr style="background-color: #e6f2ff;"> <td nowrap="nowrap" align="left">MiniCPM-Llama3-V 2.5</td> <td>8.5B</td> <td><strong>725</strong></td> <td><strong>76.6</strong></td> <td><strong>84.8</strong></td> <td><strong>65.1</strong></td> <td>2024.6</td> <td><strong>77.2</strong></td> <td><strong>74.2</strong></td> <td><strong>45.8</strong></td> <td><strong>54.3</strong></td> <td><strong>86.7</strong></td> <td><strong>63.5</strong></td> <td><strong>89.7</strong></td> </tr> </tbody> </table> </div> * We evaluate the officially released checkpoint by ourselves. </details> <div align="center">

Evaluation results of multilingual LLaVA Bench

</div>

Examples

Model Zoo

Model	Device	Memory	Description	Download
MiniCPM-Llama3-V 2.5	GPU	19 GB	Strong end-side multimodal performance.	🤗 </img>
MiniCPM-Llama3-V 2.5 gguf	CPU	6 GB	The gguf version, lower memory usage and faster inference.	🤗 </img>
MiniCPM-Llama3-V 2.5 int4	GPU	8 GB	The int4 quantized version, lower GPU memory usage.	🤗 </img>