mirror of https://github.com/ggml-org/llama.cpp
64 lines
5.2 KiB
Markdown
64 lines
5.2 KiB
Markdown
# Multimodal Support in llama.cpp
|
|
|
|
This directory provides multimodal capabilities for `llama.cpp`. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported.
|
|
|
|
> [!IMPORTANT]
|
|
>
|
|
> Multimodal support can be viewed as a sub-project within `llama.cpp`. It is under **very heavy development**, and **breaking changes are expected**.
|
|
|
|
The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify:
|
|
|
|
- [#3436](https://github.com/ggml-org/llama.cpp/pull/3436): Initial support for LLaVA 1.5 was added, introducing `llava.cpp` and `clip.cpp`. The `llava-cli` binary was created for model interaction.
|
|
- [#4954](https://github.com/ggml-org/llama.cpp/pull/4954): Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing `llava.cpp`, `clip.cpp`, and `llava-cli` infrastructure.
|
|
- **Expansion & Fragmentation:** Many new models were subsequently added (e.g., [#7599](https://github.com/ggml-org/llama.cpp/pull/7599), [#10361](https://github.com/ggml-org/llama.cpp/pull/10361), [#12344](https://github.com/ggml-org/llama.cpp/pull/12344), and others). However, `llava-cli` lacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries like `qwen2vl-cli`, `minicpmv-cli`, and `gemma3-cli`. While functional, this proliferation of command-line tools became confusing for users.
|
|
- [#12849](https://github.com/ggml-org/llama.cpp/pull/12849): `libmtmd` was introduced as a replacement for `llava.cpp`. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs.
|
|
- [#13012](https://github.com/ggml-org/llama.cpp/pull/13012): `mtmd-cli` was added, consolidating the various model-specific CLIs into a single tool powered by `libmtmd`.
|
|
|
|
## Pre-quantized models
|
|
|
|
See the list of pre-quantized model [here](../../docs/multimodal.md)
|
|
|
|
## How it works and what is `mmproj`?
|
|
|
|
Multimodal support in `llama.cpp` works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model.
|
|
|
|
This approach keeps the multimodal components distinct from the core `libllama` library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into `libllama` is currently challenging.
|
|
|
|
Consequently, running a multimodal model typically requires two GGUF files:
|
|
1. The standard language model file.
|
|
2. A corresponding **multimodal projector (`mmproj`)** file, which handles the image encoding and projection.
|
|
|
|
## What is `libmtmd`?
|
|
|
|
As outlined in the history, `libmtmd` is the modern library designed to replace the original `llava.cpp` implementation for handling multimodal inputs.
|
|
|
|
Built upon `clip.cpp` (similar to `llava.cpp`), `libmtmd` offers several advantages:
|
|
- **Unified Interface:** Aims to consolidate interaction for various multimodal models.
|
|
- **Improved UX/DX:** Features a more intuitive API, inspired by the `Processor` class in the Hugging Face `transformers` library.
|
|
- **Flexibility:** Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models.
|
|
|
|
## How to obtain `mmproj`
|
|
|
|
Multimodal projector (`mmproj`) files are specific to each model architecture.
|
|
|
|
For the following models, you can use `convert_hf_to_gguf.py` with `--mmproj` flag to get the `mmproj` file:
|
|
- [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) ; See the guide [here](../../docs/multimodal/gemma3.md) - Note: 1B variant does not have vision support
|
|
- SmolVLM (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
|
|
- SmolVLM2 (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB))
|
|
- [Pixtral 12B](https://huggingface.co/mistral-community/pixtral-12b) - only works with `transformers`-compatible checkpoint
|
|
- Qwen 2 VL and Qwen 2.5 VL (from [Qwen](https://huggingface.co/Qwen))
|
|
- [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
|
|
- InternVL 2.5 and InternVL 3 from [OpenGVLab](https://huggingface.co/OpenGVLab) (note: we don't support conversion of `InternVL3-*-hf` model, only non-HF version is supported ; `InternLM2Model` **text** model is not supported)
|
|
|
|
For older models, please refer to the relevant guide for instructions on how to obtain or create them:
|
|
|
|
NOTE: conversion scripts are located under `tools/mtmd/legacy-models`
|
|
|
|
- [LLaVA](../../docs/multimodal/llava.md)
|
|
- [MobileVLM](../../docs/multimodal/MobileVLM.md)
|
|
- [GLM-Edge](../../docs/multimodal/glmedge.md)
|
|
- [MiniCPM-V 2.5](../../docs/multimodal/minicpmv2.5.md)
|
|
- [MiniCPM-V 2.6](../../docs/multimodal/minicpmv2.6.md)
|
|
- [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md)
|
|
- [IBM Granite Vision](../../docs/multimodal/granitevision.md)
|