Summary of Efficient Vision-Language Instruction Tuning for LLMs

Summary Efficient Vision-Language Instruction Tuning for LLMs arxiv.org

8,028 words - PDF document - View PDF document

One Line

Efficient Vision-Language Instruction Tuning for LLMs proposes a solution for optimizing multimodal LLMs with a novel routing algorithm called Mixture-of-Modality Adaptation (MMA), resulting in superior reasoning ability and reduced training time and GPU memory usage.

Key Points

Efficient Vision-Language Instruction Tuning for LLMs proposes a cost-effective solution called Mixture-of-Modality Adaptation (MMA) that allows Large Language Models (LLMs) to shift between single- and multi-modal instructions without sacrificing natural language understanding abilities.
The proposed solution, LaVIN, is validated through experiments on multimodal science question answering and multimodal dialogue, demonstrating competitive performance and superior training efficiency compared to existing multimodal LLMs.
LaVIN is a large vision-language instructed model that predicts the next token using a multimodal input. It is parameter-efficient and lightweight, with a simpler architecture than previous works and a visual adapter to transform visual features.
The Mixed-Modality Adversarial (MMA) training scheme optimizes the entire model end-to-end without requiring additional VL pre-training, keeping the number of optimized parameters at a small scale, reducing training time and storage cost.
LaVIN achieves strong results on ScienceQA, outperforming existing few-shot LLMs by using a larger LLM, MM-Adapter, stronger image encoder, joint optimization, and vision modality. It includes adapters for different tasks such as translation, math, coding, image captioning, and visual question answering.

Summaries

232 word summary

Efficient Vision-Language Instruction Tuning for LLMs proposes a solution for vision-language instruction tuning called Mixture-of-Modality Adaptation (MMA) using a novel routing algorithm to optimize the multimodal LLM via a small number of parameters, reducing training costs. LaVIN, a visual language model for instruction-following tasks, demonstrates superior reasoning ability compared to existing multimodal LLMs and achieves overall better responses across multiple tasks while reducing training time and GPU memory usage by up to 80%. The paper compares LaVIN with existing multimodal LLMs in multi-turn conversations and shows that LaVIN has a higher GPT4 score among all compared models, suggesting the significance of MMA in adapting LLMs to multi-modal tasks. The paper also lists various studies that have improved vision-language understanding and generation, including prefix-tuning, bootstrapping language-image pre-training, and grounding language models to images. The paper mentions various libraries such as FairScale and Vicuna. Efficient Vision-Language Instruction Tuning for LLMs is a document that focuses on language understanding with advanced large language models. It references various sources related to machine learning, including papers on transformer models, language models, visual models, and mathematical reasoning tasks. The document discusses enhancing vision-language models, conditional prompt learning for chain-of-thought reasoning in language models, multimodal action, and efficient fine-tuning of language models with zero-init attention. Additionally, the document discusses benchmarking and analysis platforms for natural language understanding, open and efficient foundation language models, and solving AI tasks with chatgpt.

513 word summary

Efficient Vision-Language Instruction Tuning for LLMs is a document that focuses on language understanding with advanced large language models. It references various sources related to machine learning, including papers on transformer models, language models, visual models, and mathematical reasoning tasks. The document discusses enhancing vision-language models, conditional prompt learning for chain-of-thought reasoning in language models, multimodal action, and efficient fine-tuning of language models with zero-init attention. Additionally, the document discusses benchmarking and analysis platforms for natural language understanding, open and efficient foundation language models, and solving AI tasks with chatgpt.

The Efficient Vision-Language Instruction Tuning for LLMs paper proposes a solution for vision-language instruction tuning called Mixture-of-Modality Adaptation (MMA) which uses a novel routing algorithm to optimize the multimodal LLM via a small number of parameters, reducing training costs. LaVIN, a visual language model for instruction-following tasks, demonstrates superior reasoning ability compared to existing multimodal LLMs and achieves overall better responses across multiple tasks while reducing training time and GPU memory usage by up to 80%. The paper compares LaVIN with existing multimodal LLMs in multi-turn conversations and shows that LaVIN has a higher GPT4 score among all compared models, suggesting the significance of MMA in adapting LLMs to multi-modal tasks.

The paper also lists various studies that have improved vision-language understanding and generation, including prefix-tuning, bootstrapping language-image pre-training, and grounding language models to images. The paper mentions various libraries such as FairScale and Vicuna. LaVIN is a new method for efficient vision-language instruction tuning using Mixture-of-Modality Adaptation (MMA) that can adapt to different modalities and preserve NLP capabilities of Large Language Models (LLMs). The MMA training scheme optimizes the entire model end-to-end without requiring additional VL pre-training, reducing training time and storage cost. LaVIN achieves strong results on ScienceQA and large-scale datasets such as Alphaca-52k and LLaVA-158k. It outperforms existing approaches with significantly lower training costs and achieves comparable performance to LLaVA. LaVIN includes adapters for different tasks such as translation, math, coding, image captioning, and visual question answering. The model also features a multimodal reasoning adapter that improves accuracy for both image-based and text-based questions. The paper releases the source code and pre-trained checkpoints associated with LaVIN. Experimental results show that LaVIN is superior in efficiency and performance compared to existing multimodal chatbots. The article also discusses the effectiveness of different LLMs, including Flamingo and InstructGPT, and the importance of instruction tuning for improving their performance on downstream tasks. Efficient Vision-Language Instruction Tuning for LLMs presents MMA, a cost-effective solution to allow Large Language Models (LLMs) to switch between single- and multi-modal instructions without sacrificing natural language understanding abilities. MMA uses lightweight adapters to enable joint optimization of image and language parameters. The proposed solution, LaVIN, is validated through experiments on multimodal science question answering and multimodal dialogue, demonstrating competitive performance and superior training efficiency compared to existing multimodal LLMs. MMA is an end-to-end optimization scheme that is cheap in training and efficient in automatic shift between text-only and image-text instructions. Modular training of LLMs is preferred over the ensemble of LLM and vision models due to computation and storage overhead.

785 word summary

Efficient Vision-Language Instruction Tuning for LLMs proposes a cost-effective solution called Mixture-of-Modality Adaptation (MMA) that allows Large Language Models (LLMs) to shift between single- and multi-modal instructions without sacrificing natural language understanding abilities. MMA uses lightweight adapters to bridge the gap between LLMs and vision tasks, enabling joint optimization of image and language parameters. The proposed solution, LaVIN, is validated through experiments on multimodal science question answering and multimodal dialogue, demonstrating competitive performance and superior training efficiency compared to existing multimodal LLMs. Existing solutions for vision-language learning are expensive and require large-scale pre-training, but MMA is an end-to-end optimization scheme that is cheap in training and efficient in automatic shift between text-only and image-text instructions. Modular training of LLMs is preferred over the ensemble of LLM and vision models due to computation and storage overhead. The paper proposes a new method called LaVIN for efficient vision-language instruction tuning using Mixture-of-Modality Adaptation (MMA) that can adapt to different modalities and preserve NLP capabilities of Large Language Models (LLMs). LaVIN can accurately execute various types of human instructions and achieve cheap and quick adaptations on VL tasks without requiring another large-scale pre-training. MMA is proposed as an end-to-end optimization regime that connects the image encoder and LLM with lightweight adapters to dynamically choose the suitable adaptation path for inputs of different modalities. The paper releases the source code and pre-trained checkpoints associated with LaVIN. Experimental results show that LaVIN is superior in efficiency and performance compared to existing multimodal chatbots. The paper validates LaVIN through quantitative experiments on ScienceQA and by applying MMA to a recently proposed LLM called LLaMA. The article also discusses the effectiveness of different LLMs, including Flamingo and InstructGPT, and the importance of instruction tuning for improving their performance on downstream tasks. LaVIN is a large vision-language instructed model that predicts the next token using a multimodal input. It is parameter-efficient and lightweight, with a simpler architecture than previous works and a visual adapter to transform visual features. The Mixed-Modality Adversarial (MMA) training scheme optimizes the entire model end-to-end without requiring additional VL pre-training, keeping the number of optimized parameters at a small scale, reducing training time and storage cost. LaVIN achieves strong results on ScienceQA, outperforming existing few-shot LLMs by using a larger LLM, MM-Adapter, stronger image encoder, joint optimization, and vision modality. It is evaluated on large-scale datasets such as Alphaca-52k and LLaVA-158k, with LaVIN-13B achieving the best results among the tested models. The study compares LaVIN with existing LLMs and conducts ablation studies to confirm the effectiveness of the proposed designs. LaVIN outperforms existing approaches with significantly lower training costs and achieves comparable performance to LLaVA. It includes adapters for different tasks such as translation, math, coding, image captioning, and visual question answering. The model also features a multimodal reasoning adapter that improves accuracy for both image-based and text-based questions. The Efficient Vision-Language Instruction Tuning for LLMs paper proposes a solution for vision-language instruction tuning called Mixture-of-Modality Adaptation (MMA) which uses a novel routing algorithm to optimize the multimodal LLM via a small number of parameters, reducing training costs. LaVIN, a visual language model for instruction-following tasks, demonstrates superior reasoning ability compared to existing multimodal LLMs and achieves overall better responses across multiple tasks while reducing training time and GPU memory usage by up to 80%. In multi-turn conversations, LaVIN outperforms other LLMs in providing high-quality responses. However, LaVIN has limitations in generating incorrect or fabricate responses and cannot identify extremely fine-grained visual content. The paper compares LaVIN with existing multimodal LLMs in multi-turn conversations and shows that LaVIN has a higher GPT4 score among all compared models, suggesting the significance of MMA in adapting LLMs to multi-modal tasks. The paper also lists various studies that have improved vision-language understanding and generation, including prefix-tuning, bootstrapping language-image pre-training, and grounding language models to images. The paper mentions various libraries such as FairScale and Vicuna. Efficient Vision-Language Instruction Tuning for LLMs is a document that focuses on language understanding with advanced large language models. It references various sources related to machine learning, including papers on transformer models, language models, visual models, and mathematical reasoning tasks. The papers discussed in the document explore the use of prompts for fine-tuning models, unsupervised multitask learning, and efficient visual adaptation via structural re-parameterization. The document also mentions a technical report on GPT-4 and a suite of fundamental mathematical reasoning tasks called Numglue. Key points discussed in the document include enhancing vision-language models, conditional prompt learning for chain-of-thought reasoning in language models, multimodal action, and efficient fine-tuning of language models with zero-init attention. Additionally, the document discusses benchmarking and analysis platforms for natural language understanding, open and efficient foundation language models, and solving AI tasks with chatgpt.

1835 word summary

Efficient Vision-Language Instruction Tuning for LLMs is a document that includes several arXiv preprints and proceedings that focus on language understanding with advanced large language models. Some of the key points discussed in the document include enhancing vision-language models, conditional prompt learning for chain-of-thought reasoning in language models, multimodal action, and efficient fine-tuning of language models with zero-init attention. Additionally, the document discusses benchmarking and analysis platforms for natural language understanding, open and efficient foundation language models, and solving AI tasks with chatgpt. Efficient Vision-Language Instruction Tuning for LLMs is a document that references various sources related to machine learning. The sources include papers on transformer models, language models, visual models, and mathematical reasoning tasks. Some of the papers discuss the use of prompts for fine-tuning models, while others explore unsupervised multitask learning. One paper presents a new dataset for open book question answering, and another proposes a method for efficient visual adaptation via structural re-parameterization. The document also mentions a technical report on GPT-4 and a suite of fundamental mathematical reasoning tasks called Numglue. Finally, there is a paper on decoupled weight decay regularization and another on visual instruction tuning. Efficient Vision-Language Instruction Tuning for LLMs is a research paper that lists various studies that have improved vision-language understanding and generation. Some of these studies include prefix-tuning, bootstrapping language-image pre-training, and grounding language models to images. Other studies mentioned in the paper are parameter-efficient transfer learning for NLP, scaling instruction-finetuned language models, and adapting vision transformers for scalable visual recognition. Additionally, the paper mentions various libraries such as FairScale, which is a general-purpose modular PyTorch library for high-performance and large-scale training, and Vicuna, an open-source chatbot that impresses with GPT-4 with 90% chatGPT quality. The paper proposes a solution for vision-language instruction tuning called Mixture-of-Modality Adaptation (MMA) which uses a novel routing algorithm to optimize the multimodal LLM via a small number of parameters, reducing training costs. MMA is capable of shifting reasoning paths for single- and multi-modal instructions, making it an affordable end-to-end optimization ability. LaVIN, a visual language model for instruction-following tasks, demonstrates superior reasoning ability compared to existing multimodal LLMs. However, LaVIN has limitations in generating incorrect or fabricate responses and cannot identify extremely fine-grained visual content. The paper compares LaVIN with existing multimodal LLMs in multi-turn conversations and shows that LaVIN has a higher GPT4 score among all compared models, suggesting the significance of MMA in adapting LLMs to multi-modal tasks. LaVIN demonstrates superior visual reasoning ability in executing single- and multi-modal instructions, including complex scenes and math problems. It also presents clear and concise coding behavior with accurate output. Compared to existing LLMs, LaVIN achieves overall better responses across multiple tasks and reduces training time and GPU memory usage by up to 80%. In multi-turn conversations, LaVIN outperforms other LLMs in providing high-quality responses. The image shows a man holding a baby and petting a brown horse, with no visible signs of rain or stormy conditions. The baby is wearing a white onesie and reaching out to touch the horse's nose, creating a moment of connection and curiosity for the child. The man and baby are both standing under an overhang, which provides incomplete protection from the rain. If it starts raining, the man will get wet as the overhang will not provide complete protection. The weather in the image appears to be sunny, suggesting that the location might be a farm or rural area where sunny conditions are more common.

In the Efficient Vision-Language Instruction Tuning for LLMs document, the authors compare the training expenditures of LaVIN, LLaVA, and BLIP2, and evaluate the effectiveness of MM-Adapter in adapting multimodal LLM. They find that text-only and text-image instruction inputs have different requirements for their adaptations, and MM-Adapter effectively decouples the inference of different modalities into two sets of adapters. The routing weights of LaVIN for text-only and text-image instruction inputs are very sharp, suggesting that the model is very confident in its decision. The performance of LaVIN is further improved by +1.09, validating the significance of MMA in adapting multimodal LLM. The document presents LaVIN, a language and vision transformer model that dynamically executes different inference paths based on input modality. It includes adapters for different tasks such as translation, math, coding, image captioning, and visual question answering. The model also features a multimodal reasoning adapter that improves accuracy for both image-based and text-based questions. The performance of LaVIN is significantly improved by using a better image encoder. With the help of multimodal tuning, LaVIN surpasses other parameter-efficient methods in terms of accuracy. The results are evaluated on 8 A100 GPUs. Efficient Vision-Language Instruction Tuning for LLMs is a study on improving the performance of multimodal language models (LLMs) on ScienceQA through joint optimization and modality adaptation. The study compares the proposed LaVIN model with existing LLMs, including LLaVA and LLaMA-Adapter, and conducts ablation studies to confirm the effectiveness of the proposed designs. LaVIN outperforms existing approaches with significantly lower training costs and achieves comparable performance to LLaVA. The joint optimization of image encoder and LLM, as well as the mixture-of-modality training, greatly contribute to the final performance. When scaled up to 13B, LaVIN can obtain more significant performance gains. The study also highlights the importance of considering the modality gap in input instructions and the learning of visual content in LLMs. LaVIN is an end-to-end multimodal LLM that achieves strong results on ScienceQA. LaVIN is compared to state-of-the-art methods and outperforms existing few-shot LLMs. The model uses a larger LLM, MM-Adapter, stronger image encoder, joint optimization, and vision modality to achieve these results. The implementation details include using ViT-L/14 as the image encoder and a cosine decay learning rate schedule. The model is evaluated on large-scale datasets such as Alphaca-52k and LLaVA-158k. LaVIN-13B achieves the best results among the tested models, while LLaMA-Adapter achieves the best results among models with fewer parameters. The article presents a new large vision-language instructed model called LaVIN, which is parameter-efficient and lightweight. It uses a multimodal input to predict the next token step by step. The architecture of LaVIN is much simpler than previous works and includes a visual adapter to transform visual features. The training scheme uses a novel method called Mixed-Modality Adversarial (MMA), which optimizes the entire model end-to-end without requiring additional VL pre-training. The number of optimized parameters is kept at a small scale, reducing training time and storage cost. The overall training objective is defined by constructing a mini training batch randomly sampled from text-only and text-image instructions. The ground-truth response and objective loss function are denoted by R and L, respectively. The Efficient Vision-Language Instruction Tuning for LLMs document proposes a learning regime for the vision-language adaptation of LLMs. The proposed method involves using Mixture-of-Modality Adapters and Mixture-of-Modality Training to optimize multimodal LLMs in an end-to-end manner. The Mixture-of-Modality Adapter can be used as the unimodal adapter to improve the adaptation ability, and the Mixture-of-Modality Training targets to freeze the large image encoder and LLM, and only fine-tune the inserted adapters. The process selects the best adaptation path according to the modalities of input instructions, and MMA can dynamically adapt to the input features. During instruction tuning, LaVIN is optimized by Mixture of Modality Training. The proposed method is efficient and cheap in training time and storage. The article discusses efficient tuning methods for multimodal language and vision models (LLMs). Most LLMs require expensive training costs, but recent modular training models have provided more efficient alternatives. The paper proposes a new method called MMA, which dynamically adjusts the adaptations of different modalities to improve adaptation capabilities and inference speed. This method is compared to other PETL methods and AdaMix, which are also aimed at reducing training and storage overhead of LLMs. The article also discusses the effectiveness of different LLMs, including Flamingo and InstructGPT, and the importance of instruction tuning for improving their performance on downstream tasks. The paper proposes a new solution for efficient vision-language instruction tuning called LaVIN, which uses Mixture-of-Modality Adaptation (MMA) to adapt to different modalities and preserve the NLP capabilities of Large Language Models (LLMs). The paper releases the source code and pre-trained checkpoints associated with LaVIN. Experimental results show that LaVIN is superior in efficiency and performance compared to existing multimodal chatbots. LaVIN can accurately execute various types of human instructions, such as coding, math, and image captioning. The paper validates LaVIN through quantitative experiments on ScienceQA and by applying MMA to a recently proposed LLM called LLaMA. MMA reduces training time and storage costs while achieving on-par performance with advanced multimodal LLMs. LaVIN can achieve cheap and quick adaptations on VL tasks without the requirement of another large-scale pre-training. The paper proposes MMA as an end-to-end optimization regime that connects the image encoder and LLM with lightweight adapters, which can dynamically choose the suitable adaptation path for the inputs of different modalities. Existing multimodal LLMs do not support text-only instructions, which undermines the NLP capabilities of LLMs due to drastic changes in their parameter spaces and increases training time and intermediate storage overhead. Efficient Vision-Language Instruction Tuning for LLMs is a preprint document that discusses different multimodal adaptation schemes for LLMs. The required VL pre-training is still expensive for a quick adaptation of cross-modal alignment. The joint optimization of LLMs and vision models still exhibits significant redundancy in terms of computation and parameters, leading to excessive memory footprints. Existing multimodal solutions for LLMs can be roughly divided into two main categories, i.e., the expert system and the modular training ones, respectively. LLMs usually serve as a manager to interpret different natural language instructions, and then call the corresponding vision models to handle the input image. The proposed Mixture-of-Modality Adaptation (MMA) is an end-to-end optimization scheme, which is cheap in training and superior in the automatic shift between text-only and image-text instructions. The modular training of LLMs is preferred because the ensemble of LLM and vision models is expensive in terms of computation and storage overhead. Large language models (LLMs) have been continuously improving in their natural language processing (NLP) abilities, and the introduction of instruction tuning has enabled LLMs to engage in human-like conversations and handle various NLP tasks. However, existing solutions for vision-language (VL) learning are prohibitively expensive and require large-scale pre-training before VL instruction tuning. To address this, the authors propose a novel and affordable solution called Mixture-of-Modality Adaptation (MMA) that enables LLMs to achieve an automatic shift between single- and multi-modal instructions without compromising their natural language understanding abilities. MMA adopts lightweight modules called adapters to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of image and language parameters. The authors validate MMA and their solution, LaVIN, through extensive experiments under two setups: multimodal science question answering and multimodal dialogue. LaVIN demonstrates competitive performance and superior training efficiency compared to existing multimodal LLMs, confirming its great potential as a general-purpose chatbot. The actual expenditure of LaVIN is extremely cheap, making it an effective solution for efficient vision-language instruction tuning for LLMs.

Raw indexed text (51,110 chars / 8,028 words / 1,104 lines)

Cheap and Quick: Efficient Vision-Language

Instruction Tuning for Large Language Models

Gen Luo 1 , Yiyi Zhou 12 , Tianhe Ren 1 , Shengxin Chen 1 , Xiaoshuai Sun 12 , Rongrong Ji 12∗

Key Laboratory of Multimedia Trusted Perception and Efficient Computing,

Ministry of Education of China, Xiamen University, 361005, P.R. China.

Institute of Artificial Intelligence, Xiamen University, 361005, P.R. China.

{luogen,chenshengxin,rentianhe}@stu.xmu.edu.cn,

{zhouyiyi,xssun,rrji}@xmu.edu.cn

Abstract

Recently, growing interest has been aroused in extending the multimodal capability

of large language models (LLMs), e.g., vision-language (VL) learning, which is

regarded as the next milestone of artificial general intelligence. However, existing

solutions are prohibitively expensive, which not only need to optimize excessive

parameters, but also require another large-scale pre-training before VL instruction

tuning. In this paper, we propose a novel and affordable solution for the effective

VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead

of using large neural networks to connect the image encoder and LLM, MMA

adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and

VL tasks, which also enables the joint optimization of the image and language

models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs

achieve an automatic shift between single- and multi-modal instructions without

compromising their ability of natural language understanding. To validate MMA,

we apply it to a recent LLM called LLaMA and term this formed large vision-

language instructed model as LaVIN. To validate MMA and LaVIN, we conduct

extensive experiments under two setups, namely multimodal science question

answering and multimodal dialogue. The experimental results not only demonstrate

the competitive performance and the superior training efficiency of LaVIN than

existing multimodal LLMs, but also confirm its great potential as a general-purpose

chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap,

e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming

the effectiveness of MMA. Our project is released at https://luogen1996.

github.io/lavin .

Introduction

In recent years, large language models (LLMs) [3, 31, 5, 46, 32] have continuously pushed the upper

limit of natural language understanding with ever increasing parameter sizes and pre-training data

scales. The introduction of instruction tuning [25, 26, 29] also enables LLMs to engage in human-like

conversations and handle various natural language processing (NLP) tasks [24, 38, 39], approaching

artificial general intelligence, e.g., GPT-3.5 [27]. The next milestone is often regarded to extend these

LLMs with multimodal capabilities, e.g., vision-language (VL) learning, making LLMs applicable to

more real-world application scenarios. Such a target has been recently realized by GPT-4 [28], which

adopts a large-scale vision-language corpus to directly train a multimodal GPT.

∗

corresponding author

Preprint. Under review.Parameter-Efficient Vision-Language Instruction Tuning for Large Language Models

Input: Text

Text+Vision

Scheme: No Optimization

Tunable Params: N/A

Model1

...

Input: Text

Text+Vision

Scheme: Multi-stage Optimization

Tunable Params: Hundreds of Millions ~ Billions

LLM

Image

Encoder

Visual

Neck

LLM

Input: Text

Text+Vision

Scheme: Joint Optimization

Tunable Params: 3~6 Millions

LLM Image

Encoder

Stage-2: Instruction Tuning Adapter

Image

Encoder

Visual

Neck

Visual

Adapter

ModelN

Stage-1: VL Alignment

(a) Expert System

(b) Modular Training Scheme

LLM

MM-

Adapter

Figure 1: Comparison of different multimodal adaptation schemes for LLMs. In the expert system,

LLMs play a role of controller, while the ensemble of LLM and vision models is expensive in terms

缺点：

of computation and storage overhead. The modular 优点：

training regime (b) requires an additional large

1. 计算效率低，参数低效

neck branch and another large-scale pre-training for 1. cross-modal

alignment, which is inefficient

计算效率高（单张A100可训练），参数高效（2～4

M） in

2. 多阶段优化进一步增大了计算量，同时优化效率低

from Mixture-of-Modality

scratch ）

training and performs worse in previous NLP tasks. 2. In 单阶段联合优化（Training

contrast, the proposed

Adaption (MMA) (c) is an end-to-end optimization scheme, which is cheap in training and superior

in the automatic shift between text-only and image-text instructions.

However, the training regime of GPT-4 [28] is prohibitively expensive, and recent endeavors [43,

44, 35, 15, 18, 1, 8, 49, 4] are still keen to efficient VL adaptions of LLMs. As shown in Fig. 1, the

existing multimodal solutions for LLMs can be roughly divided into two main categories, i.e., the

expert system and the modular training ones, respectively. In the expert system solution [43, 44, 35],

LLMs usually serve as a manager to interpret different natural language instructions, and then call

the corresponding vision models to handle the input image, e.g., image captioning [16], visual

question answering [16] or text-to-image generation [33]. The advantage of this solution is that it

does not require the re-training of LLMs and can make full use of existing vision models. However,

the ensemble of LLMs and various vision models still exhibits significant redundancy in terms

of computation and parameters, leading to excessive memory footprints. Meanwhile, the joint

optimization of LLMs and vision models is still an obstacle.

In this case, increasing attention has been paid to the modular training of LLMs [15, 18, 49, 13, 49].

As illustrated in Fig. 1, this paradigm often requires LLMs to deploy an additional neck branch to

connect the visual encoders, and then performs another pre-training on numerous image-text pairs for

cross-modal alignment. Afterwards, the neck branch and LLM are jointly tuned via VL instructions.

Despite the effectiveness, the required VL pre-training is still expensive for a quick adaptation of

LLMs. For instance, the pre-training of BLIP2 [15] consumes more than 100 GPU hours on 129

millions of image-text pairs. In addition, this paradigm often requires to update most parameters of

LLM, limiting the efficiency of VL instruction tuning. For example, LLaVA-13B [18] fully fine-tunes

the entire LLM during VL instruction tuning, resulting in significant increases in training time and

intermediate storage overhead 2 . More importantly, these fine-tune schemes will inevitably undermine

the NLP capabilities of LLMs due to the drastic changes in their parameter spaces. For instance,

the existing multimodal LLMs, such as BLIP2 [15] and miniGPT4 [49], do not support text-only

instructions, greatly hindering their applications.

In this paper, we propose a novel and efficient solution for vision-language instruction tuning, termed

Mixture-of-Modality Adaptation (MMA). Different from existing modular training scheme [15, 18],

MMA is an end-to-end optimization regime. By connecting the image encoder and LLM with

lightweight adapters, MMA can jointly optimize the entire multimodal LLM via a small number

of parameters, saving more than thousands times of storage overhead compared with existing

solutions [18, 49, 15]. To obtain a quick shift between text-only and image-text instructions, MMA

equips the inserted adapters with a routing scheme, which can dynamically choose the suitable

adaptation path for the inputs of different modalities, thereby well preserving the NLP capability of

LLMs. To validate MMA, we apply it to a recently proposed LLM called LLaMA [37], and term this

new large vision-language instructed model as LaVIN. With the help of MMA, LaVIN can achieve

cheap and quick adaptations on VL tasks without the requirement of another large-scale pre-training.

To validate LaVIN, we first conduct quantitative experiments on ScienceQA [21]. Experimental

results show that LaVIN can achieve on-par performance with the advanced multimodal LLMs, e.g.,

LLaVA [18], while reducing up to 71.4% training time and 99.9% storage costs. Notably, fine-tuning

The checkpoints are often stored during training, and each of them takes up 26GB for storage.

2LaVIN on ScienceQA only takes 1.4 hours with 8 A100 GPUs, and the updated parameters are

only 3.8M. In addition, we also extend LaVIN to a multimodal chatbot via tuning on 52k text-only

instructions [36] and 152k text-image pairs [18]. The qualitative comparisons show that LaVIN can

accurately execute various types of human instructions, e.g., coding, math and image captioning,

while yielding superior vision-language understanding than existing multimodal chatbots [49, 15, 44].

In summary, our contributions are three folds:

• We present a novel and efficient solution for vision-language instruction tuning, namely

Mixture-of-Modality Adaptation (MMA), which does not require the expensive VL pretrain-

ing and can maintain the NLP capabilities of LLMs.

• Based on MMA, we propose a new multimodal LLM, namely LaVIN. Experimental re-

sults show the superior efficiency and competitive performance of LaVIN against existing

multimodal LLMs, and also confirm its great potential as a general-purpose chatbot.

• We release the source code and pre-trained checkpoints associated with this paper. We

believe that our project can well facilitate the development of multimodal LLM.

2.1

Related Work

Parameter-Efficient Transfer Learning

Since large language models have ever-increasing parameter sizes, parameter-efficient transfer

learning (PETL) [11, 17, 22, 12, 19, 10] has gained increasing attention to reduce training and storage

overhead of LLMs. PETL aims to insert or fine-tune a small number of parameters into LLMs,

thereby achieving the adaption on downstream tasks. In early efforts [11, 10], a small MLP network,

known as Adapter [11], is inserted into LLMs to project their hidden features to the semantic spaces

of downstream tasks. Based on Adapter, numerous PETL methods [17, 40, 22, 12, 19, 10] have been

proposed to further enhance adaptation capabilities [17, 40, 22, 19, 10] and inference speed [12].

Among them, AdaMix [40] is a method relatively close to our MMA, which also includes a set of

candidate adapters for downstream task routing. However, AdaMix is static and task-dependent,

of which routing path is fixed after training. In contrast, our MMA is a dynamic method based

on the input modality embeddings. Moreover, AdaMix is still an unimodal module and hard to

adaptively adjust the adaptions of different modalities. Driven by the great success in NLP, PETL has

also achieved significant progresses in large vision models [23, 2, 48], e.g., ViT [7] and CLIP [30].

Despite the effectiveness, PETL for multimodal LLMs still lacks explorations. A very recent PETL

method [45] is proposed for multimodal LLMs , but its performance still lags behind full fine-tuning.

2.2

Multimodal Instruction-following LLMs

Instruction tuning [25, 26, 29, 41, 42] aims to fine-tune LLMs on natural language corpus describing

diverse NLP tasks. This simple and effective method has been successfully applied to various well-

known LLMs, such as InstructGPT [29] and FLAN-T5 [6], greatly improving their performance and

generalization ability. Motivated by this success, numerous efforts have been devoted to constructing

multimodal instruction-following LLMs. Existing works can be categorized into two groups, e.g.,

the expert systems [43, 44, 35] and modular training ones [15, 18, 49, 13, 49], respectively. The

representative expert systems, such as Visual ChatGPT [43] and MMREACT [44], employ LLMs as

the controller to invoke various vision models to accomplish the VL instructions. Despite the effec-

tiveness, this heavy system also incurs non-negligible burdens in terms of storage and computation.

Recently, modular training models [15, 18, 49, 13, 49] as proposed as more efficient alternatives.

Among them, Flamingo [1] is the first large-scale multimodal LLM that pre-trains on numerous

image-text pairs, which demonstrates strong zero-shot ability on diverse tasks. The following works,

including BLIP-2 [15], FROMAGe [14], PaLM-E [8], KOSMOS-1 [13] and LLaVA [18], not only

optimize the model architecture [15, 14, 8, 13] but also improve the quality of VL instruction data [18].

Despite their effectiveness, most multimodal LLMs require expensive training costs and perform

worse on text-only instructions.

3Modality Token

ViT

Image:

...

LLM

Mixture-of-Modality

Adapter

Visual Tokens

Adapter

Instruction:

“describe this image.”

Multi-head Attention

Tokenizer

Feed-forward Network

Textual Tokens

Text-only Data

Sampling

Text-image Data

L× Layers

Response: The image

features a small brown dog

running across a grassy

field, with its tongue out

and ears flapping. The dog

appears to be enjoying itself

as it runs through the open

space. In the background,

there are several trees

scattered across the field,

adding to the natural setting.

Mixture-of-Modality

Training

Parameter Updates

Figure 2: The overview of the Mixture-of-Modality Adaptation (MMA) and the architecture of

LaVIN. In LaVIN, the novel Mixture-of-Modality Adapters are employed to process the instructions

of different modalities. During instruction tuning, LaVIN is optimized by Mixture of Modality

Training (MMT) in an end-to-end manner.

3.1

Method

Mixture-of-Modality Adaptation

In this paper, we propose a novel learning regime for the vision-language adaption of LLMs, which

is called Mixture-of-Modality Adaptation (MMA). As shown in Fig. 2, MMA includes two novel

designs, namely Mixture-of-Modality Adapter (MM-Adapter) and Mixture-of-Modality Training

(MMT). Specifically, MM-Adapter extends LLMs with multimodal abilities via lightweight adapters,

which also realizes the automatic shift between single- and multi-modal instructions. Afterwards, the

entire multimodal LLM is jointly optimized via MMT, which is cheap in training time and storage.

Mixture-of-Modality Adapter (MM-Adapter). As shown in Fig. 2, we connect the LLM with the

image encoder with a set of lightweight adaptation modules. In the image encoder, these modules can

be the common adapters [11, 23]. In the LLM, unimodal adaptation modules are inferior in handling

single- and multi-modal instructions simultaneously.

In particular, we first introduce a modality token t m ∈ R c to

indicate the input modality, which is defined by

t m = mE m .

(1)

...

W # W !"

W W

Router

Here, E m ∈ R

is the modality embedding. m ∈ R is a

one-hot vector to represent the input modality. Based on the

modality token t m , MM-Adapter can dynamically adjust the

Routing Weights

Generation

adaptations for the input features Z ∈ R n×c . In practice, Z can

be the single- or multi-modal features. Thus, MM-Adapter can

be defined by

Figure 3: Illustration of the

Adapter

Z ′ = Z + s · router f a 1 (Z), f a 2 (Z); f w (t m ) .

(2) Mixture-of-Modality

(MMA). MMA can dynamically

Here, f a 1 and f a 2 are RepAdapters [23] in our paper. s is the select the appropriate adapter

scale factor, and router(·) is a routing function to decide the according to the input modalities.

routing path of two adapters. To further reduce the parameter

costs, the downsampling projection of two adapters are shared. Mixture-of-Modality Adapter

2×c

As shown in Fig. 3, the key to realize the dynamic adaptations lies in the design of the routing function

router(·) , which is formulated as

router f a 1 (Z), f a 2 (Z) = ŵ 0 · f a 1 (Z) + ŵ 1 · f a 2 (Z),

(3)

t m W m + b m

where ŵ = f w (t m ) = softmax(

Text-Only InstHere, W m ∈ R c×2 and b m ∈ R 2 are the weight matrix and bias, respectively. ŵ denotes the routing

weights, and τ is the temperature of the softmax. Based on Eq. 2 and 3, MM-Adapter can select the

best adaption path according to the modalities of input instructions. More importantly, the process

of MM-Adapter only introduces a few of additional parameters, which is still efficient. Meanwhile,

MM-Adapter can be used as the unimodal adapter to improve the adaptation ability, thus we also

apply it to the image encoder.

Mixture-of-Modality Training (MMT). Based on MM-Adapter, the target of MMT is to freeze

the large image encoder and LLM, and only fine-tune the inserted adapters. In this case, the entire

multimodal LLM can be jointly optimized in an end-to-end manner. Specifically, the end-to-end

optimization objective can be formulated by

arg min L(f ϕ (I, T ), R; θ a ).

(4)

Here, R and L(·) denote the ground-truth response [21] and the objective loss function, respectively.

f ϕ is the LLM, and θ a denotes the adaptation parameters. I ∈ R h×w×3 and T ∈ R l denote the input

image and text instruction, respectively.

During training, we construct a mini training batch randomly sampled from text-only and text-image

instructions. In this case, the overall training objective L can be defined by

L =

m S+1

log p(R s i |I i , T i , R 0:s−1

; θ a ) +

i=1 s=1

n S+1

log p(R s j |T j , R 0:s−1

; θ a ).

(5)

j=1 s=1

Here, m and n denote the number of text-only and text-image samples in a minibatch. After MMT,

the multimodal LLM can effectively execute the input instructions of different modalities.

In our training scheme, the number of optimized parameters is still kept at a very small scale, e.g.,

3∼5M, which greatly reduces the training time and the storage cost. Compared to existing modular

training paradigm, MMA does not require additional VL pre-training and can optimize the entire

model end-to-end, further improving the training efficiency.

3.2

Large Vision-language Instructed Model

To validate MMA, we apply it to an LLM called LLaMA [37] and adopt CLIP-ViT [30] as the image

encoder. Here, we term this new large vision-language instructed model as LaVIN.

Given the input image I ∈ R h×w×3 , we use the [cls] tokens from every fourth layer of ViT [7]

as the visual feature, denoted as X ∈ R n×d . In the image encoder, we insert the adapters before

the multi-head attention modules. We represent the text instruction with word embeddings, denoted

as Y ∈ R l×c . Then, a simple visual adapter is used to transform the visual features to the same

dimension with the LLM, which is defined by

X ′ = σ(XW u + b u )W d + b d .

d×d h

d h ×c

(6)

d h

Here, W d ∈ R

and W u ∈ R

denote the weight matrices, while W d ∈ R and b u ∈ R are

the bias terms. σ is the SwiGLU activation function [34]. In practice, d h is much smaller than d and

c, so the input of LLM can be defined by

[t m , X ′ , Y ] text-image,

Z =

(7)

[t m , Y ]

text only.

Here, [·] denotes the concatenation. Based on the multimodal input, LLM can predict the next token

step by step, which can be formulated by

p t =

S+1

p(R s |Z, R 0:s−1 ; θ l , θ a )

(8)

s=1

Here, p t ∈ R m denotes the probabilities of the predicted word and m is the length of the word

embeddings. θ l and θ a denote the parameters of LLM and adaptation modules, respectively.

Compared with previous works [15, 49, 18], the architecture of LaVIN is much simpler and more

lightweight, which is also easier to optimize. For example, the visual neck of LaVIN is 6 times

smaller than that of LLaVA [18], but the performance of two models is close.

5Method

#T-Param

Zero- & few-shot methods

Human [21]

GPT-3.5 [21]

GPT-3.5 (CoT) [21]

GPT-4 [28]

LLM

NAT

Subject

SOC LAN

Context Modality

TXT

IMG

NO Grade

G1-6 G7-12

Average

✗

✓

✓ 90.23

74.64

75.44

84.06 84.97

69.74

70.87

73.45 87.48

76.00

78.09

87.36 89.60

74.44

74.68

81.87 87.50

67.28

67.43

70.75 88.10

77.42

79.93

90.73 91.59

76.80

78.23

84.69 82.42

68.89

69.68

79.10 88.40

73.97

75.17

82.69

Representative & SoTA models

UnifiedQA [21]

223M

MM-CoT Base [47]

223M

MM-CoT Large [47]

738M

LLaVA [18]

13B ✗

✗

✓ 71.00

87.52

95.91

90.36 76.04

77.17

82.00

95.95 78.91

85.82

90.82

88.00 66.42

87.88

95.26

89.49 66.53

82.90

88.80

88.00 81.81

86.83

92.89

90.66 77.06

84.65

92.44

90.93 68.82

85.37

90.31

90.90 74.11

84.91

91.68

90.92

Parameter-efficient methods

LLaMA-Adapter [45]

1.8M

LaVIN-7B (ours)

3.8M

LaVIN-13B (ours)

5.4M

LaVIN-13B† (ours)

5.4M ✓

✓

✓ 84.37

89.25

90.32

89.88 88.30

94.94

94.38

94.49 84.36

85.24

87.73

89.82 83.72

88.51

89.44

88.95 80.32

87.46

87.65

87.61 86.90

88.08

90.31

91.85 85.83

90.16

91.19

91.45 84.05

88.07

89.26

89.72 85.19

89.41

90.50

90.83

Table 1: Comparison on ScienceQA test set. Question classes: NAT = natural science, SOC = social

science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6

= grades 1-6, G7-12 = grades 7-12. † denotes that LaVIN is trained with 40 epochs. #T-Params

denotes that the number of trainable parameters.

Experiments

4.1

Datasets and Metrics

ScienceQA. ScienceQA [21] is the large-scale multimodal dataset for science question answering,

which covers various domains, including 3 subjects, 26 topics, 127 categories and 379 skills. Sci-

enceQA consists of text-only and text-image examples in three splits namely train, val and test, with

12,726, 4,241 and 4,241 examples, respectively. We evaluate our model using average accuracy.

Alphaca-52k & LLaVA-158k. Alphaca-52k [36] contains 52k text-only instruction-following data

generated by GPT-3.5 [3]. LLaVA-158k [18] is a large-scale text-image instruction-following dataset,

where the answer is automatically generated by GPT-4 [28]. Following LLaVA [18], GPT-4 is

employed to evaluate the quality of the chatbot’s responses, which will assign higher scores to

superior responses within a range of 1 to 10.

4.2

Implementation Details

We employ the ViT-L/14 [7] of the pre-trained CLIP [30] as the image encoder. The visual features

consist of six [cls] tokens extracted from every fourth layer of ViT-L/14. For LLM, LLaMA-

7B [37] and LLaMA-13B [37] are used. The default dimension of the visual neck is set to 128.

The dimension of MM-Adapter is 8, and the temperature is set to 10 for LaVIN-7B and 5 for

LaVIN-13B. For text-only baseline, the image encoder is removed, and MM-Adapter is replaced with

RepAdapter [23]. We adopt AdamW [20] as the optimizer, and train the model for 20 epochs with

a cosine decay learning rate schedule. The batch size, learning rate and weight decay are set to 32,

9e-3 and 0.02, respectively. During the generation stage, the decoding uses top-p sampling with a

temperature of 0.1 and a top-p value of 0.75, respectively. For the experiments of multimodal chatbot,

all hyperparameters remain the same, except for the training epochs, which are reduced to 15.

4.3

4.3.1

Experimental Results

Results on ScienceQA

Comparison with existing methods. In Tab. 1, We first compare LaVIN with the state-of-the-art

methods on ScienceQA. From this table, the first observation is that the few-shot LLMs, such as GPT-

4, still perform worse than human, suggesting the great challenge of ScienceQA. In contrast, existing

supervised methods [18, 45, 47] yield better results. In particular, MM-CoT Large [47] achieves the

best performance, e.g., 91.68. However, MM-CoT mainly focuses on the multimodal chain-of-thought

for language models, of which contribution is orthogonal to our approach. In particular, LLaVA [18]

is an end-to-end multimodal LLM, which is more close to our work. The results show that LLaVA

6Settings

Text Only

+ Vision Modality (MMT)

+ Joint Opt. (MMT)

+ Stronger Image Enc.

+ MM-Adapter

+ Larger LLM (13B)

#T-Params NAT SOC LAN TXT IMG NO G1-6 G7-12 Avg.

1.8M

2.4M

2.5M

2.9M

3.8M

5.4M 82.86

85.97

86.59

88.01

89.25

90.32 82.56

90.66

94.71

94.94

94.38 82.28

83.55

82.91

83.64

85.24

87.73 81.23

84.90

85.63

87.15

88.51

89.44 75.81

83.59

84.98

86.81

87.46

87.65 86.06

86.41

87.04

88.08

90.31 83.26

88.14

88.62

89.87

90.16

91.19 81.54

83.06

85.04

85.56

88.07

89.26 82.65 (+0.00)

86.32 (+3.67)

87.34 (+4.69)

88.33 (+5.68)

89.41 (+6.76)

90.50 (+7.85)

Table 2: Ablation studies on ScienceQA test set. For the text-only baseline, we use the image caption

to prompt the model. ViT-B/16 and LLaMA-7B are used as the default image encoder and LLM.

“Joint Opt” denotes the joint optimization of image encoder and LLM. The Mixture-of-Modality

Training (MMT) is ablated with the settings of “Vision Modality” and “Joint Opt.”.

remains competitive performance against MM-CoT Large [47], especially in the category of SOC.

Despite the effectiveness, its number of trainable parameters is still large, leading to higher training

overhead. LLaMA-Adapter [45] adopts a parameter-efficient scheme to reduce the training overhead,

but its performance still greatly lags behind LLaVA. Compared to these approaches, LaVIN achieves

the better trade-offs between performance and training efficiency. For example, LaVIN-7B consumes

a similar scale of trainable parameters as LLaMA-Adapter [45], while outperforming it by +4.22

gains. When scaling up to 13B, LaVIN can obtain more significant performance gains, i.e., +5.64.

Compared to LLaVA, LaVIN-13B also achieves comparable performance and even performs better

in some question classes, e.g., LAN and NO. Considering the much lower training costs than LLaVA,

such competitive performance greatly confirms the efficiency and designs of LaVIN.

In Tab. 3, we compare LaVIN with existing

Methods

#T-Params Accuracy

methods without VL pre-training. From this

table, we observe that both LLaVA [18] and

LLaVA [18]

13B

85.81

LLaMA-Adapter achieve the similar perfor-

LLaMA-Adapter [45]

1.8M

85.19

mance, i.e., 85.81 vs. 85.19. In particular,

LaVIN-7B

3.8M

90.50 (+4.69)

LLaVA [18] and LLaMA-Adapter [45] freeze

LaVIN-13B

5.4M

90.83 (+5.02)

the image backbone, and the entire multimodal

LLM is not jointly optimized, which hinders Table 3: Results of LaVIN and existing multimodal

the learning of visual content. Moreover, the LLMs without the pre-training stage. We report

adaptation module in LLaMA-Adapter does not the average accuracy on ScienceQA test set.

consider the modality gap in the input instruc-

tions, greatly limiting its performance upper bound. In contrast, with the help of MMA, LaVIN

significantly outperforms these approaches, e.g., +5.02 gains over LLaVA. These results validate the

proposed MMA towards the effective and efficient VL adaption, and confirm the designs of LaVIN.

Ablation study. To gain deep insights into

MMA and LaVIN, we conduct comprehensive

#T-Params Memory

Time

#Storage

ablation studies in Tab. 2. From this table, we Methods

can see that each design of MMA and LaVIN BLIP2 [15]

188M

>200 hours

greatly contributes to the final performance. As LLaVA [18]

13B

OOM

N/A

shown in Tab. 2, the mixture-of-modality train- LLaVA ‡ [18]

13B

36.8G

7 hours

26GB

ing (MMT) brings the most significant gains, LaVIN-7B

3.8M

33.9G 1.4 hours

15M

e.g., +4.69. In MMT, the joint training with LaVIN-13B

5.4M

55.9G

2 hours

20M

the vision modality provides up to +3.67 perfor-

mance gains for LaVIN. With the joint optimiza- Table 4: Training costs of LaVIN and existing

tion of the image encoder and LLM, the perfor- multimodal LLMs on ScienceQA. ‡ denotes that

mance of LaVIN further boosts from 86.32 to GPU memory-saving techniques are used. “OOM”

87.34, suggesting the significance of the joint denotes out of GPU memory. For BLIP2 , we only

optimization for multimodal LLMs. With the calculate its pre-training costs as a reference. All

help of MMT, LaVIN already surpasses the ex- results are evaluated on 8 A100 GPUs.

isting parameter-efficient method, i.e., LLaMA-

Adapter. Additionally, the stronger image encoder, i.e., ViT-L/14, also improves the average accuracy

by 0.99. An interesting observation is that a better image encoder provides noticeable performance

gains for both image-based and text-based questions. When adopting MM-Adapter to LaVIN, we

observe +1.08 gains on average accuracy. Such an improvement only requires extra 0.9M parameters,

which is very lightweight. Meanwhile, the performance of LaVIN is significantly improved by

MM-Adapter on more challenging metrics like G7-12, i.e., +2.51. After scaling up LLM to 13B,

7“Describe the image”

0.94 0.91 0.99 0.02 0.02 0.04 0.99 0.00 0.99 1.00

0.06 0.09 0.01 0.98 0.98 0.96 0.00 0.99 0.00 0.00

…

（a） Text-Image Instruction

“Write a python

code that: Print

out all prime

numbers up to

100.”

0.18 0.14 0.15 0.86 0.87 0.82 0.11 0.97 0.05 0.01

0.82 0.86 0.85 0.14 0.13 0.18 0.89 0.07 0.95 0.99

…

（b） Text-Only Instruction

Figure 4: Visualization of the dynamic inference paths between two adapters in the last 10 layers of

LaVIN-7B. The values in the box denote the routing weights. Given input instructions of different

modalities, LaVIN can dynamically execute different inference paths based on the input modality

embeddings.

If a toy costs $24 and the price Translate the following

is increased by 20%, what is Chinese into English :生如夏

the new price of the toy?

花之绚烂死如秋叶之静美. Write a python code that: Print

I mix water with water, do I

get thin water or thick water? out all prime numbers up to 100.

LaVIN (ours):

The new price of the toy is

$28.80. LaVIN (ours):

Life is like a summer flower,

death is like a autumn leaf,

quiet and beautiful. LaVIN (ours):

You will get the same type of

water, just a mixture of the

two.

LLaMA-Adapter:

The new price of the toy is

$28.40. LLaMA-Adapter:

Born like summer flowers,

die like autumn leaves. LLaMA-Adapter:

Thin water.

(a) Math

(b) Translation

How many food

containers are

there in the image?

LaVIN (ours)

: In the image, there are

two food containers on the

dining table. One container

holds a sandwich, and the

other holds a plate of

broccoli. The sandwich

appears to be cut in half,

and the broccoli is served

on a separate plate.

LLaVA

: There are three

food containers in the

image: a bowl, a

plate, and a sandwich.

(e) Visual Question Answering

Describe the

Image.

LaVIN (ours)

: The image features a beach scene

with a large number of kites flying

in the sky.

LLaVA

: The image captures a lively beach

scene with several people enjoying

kite flying near the ocean.

(f) Image Captioning

LLaMA-Adapter:

(d) Coding

The boy's

clothes in the

image are wet,

is it because

of the rain?

LaVIN (ours)

: No, the boy's clothes are wet

because he is surfing in the ocean.

LLaVA

: Yes, it is likely that the boy's

clothes are wet due to the rain.

(g) Multimodal Reasoning

Figure 5: Comparison between LaVIN-13B and existing methods on single- and multi-modal

instructions. The noteworthy aspects of the responses are highlighted in green, whereas the illogical

portions are marked in red. More tasks and examples are given in appendix.

the performance of LaVIN is further improved by + 1.09. Overall, these ablations well validate the

significance of MMA in adapting multimodal LLM, and also confirm the effectiveness of LaVIN.

In Fig. 4, we visualize the routing weights of LaVIN for text-only and text-image instruction inputs.

From the figure, the first observation is that MM-Adapter effectively decouples the inference of

different modalities into two set of adapters. As shown in Fig. 4, the inference path for text-image

instruction input significantly differs from that of text-only instruction. Meanwhile, the learned

routing weights are also very sharp, i.e., close to 0 or 1, suggesting that the model is very confident to

its decision. From these two observations, we can find that the text-only and text-image instruction

inputs actually have different requirements for their adaptations, thus common unimodal adapters

are usually suboptimal to simultaneously adapt two modalities. This visualizations also confirm the

effectiveness of MM-Adapter.

Comparison of training efficiency. In Tab. 4, we compare the training expenditures of LaVIN,

LLaVA [18] and BLIP2 [15]. The first observation is that the pre-training cost of BLIP2 is ac-

8: If it rains now, will

the man get wet?

: What is the weather

in the image now?

LaVIN (ours)

:The weather in the image

appears to be sunny, as the man

is holding a baby in his arms

outside, and there are no

visible signs of rain or stormy

conditions. The presence of a

horse in the scene suggests that

the location might be a farm or

a rural area, where sunny

weather is more common.

GPT4 Score: 9

: What is the baby

doing in the image?

LaVIN (ours)

: Yes, if it rains now, the man

will get wet. In the image, the

man is holding a baby and

petting a brown horse, and they

are both standing under an

overhang. If it starts raining,

the overhang will not provide

complete protection from the

rain, and the man will get wet

as he continues to interact

with the horse and the baby.

GPT4 Score: 8

LaVIN (ours)

: In the image, the baby is

being held by a man, who is

also holding a horse's nose.

The baby is reaching out to

touch the horse's nose,

likely experiencing the

horse for the first time.

This interaction between the

baby and the horse creates a

moment of connection and

curiosity for the child.

GPT4 Score: 9

BLIP2 : Yes.

GPT4 Score: 6

BLIP2: Petting the horse.

GPT4 Score: 7

: What color is the

baby's clothes?

LaVIN (ours)

: The baby's clothes are

white. In the image, the

man is holding the baby

while the horse is nuzzling

the baby's face. The baby

is wearing a white onesie,

which stands out against

the man's brown shirt

and the horse's color.

GPT4 Score: 9

BLIP2 : Blue.

GPT4 Score: 2

LLaVA

: The baby is touching the

horse, reaching out to pet

it while being held by the

man. LLaVA

: The baby is wearing

white clothes.

: The weather in the image is

sunny. LLaVA

: No, the man will not get wet if

it starts raining, as he is already

inside the stable with the horse.

The stable provides shelter

from the rain, keeping both

the man and the horse dry. GPT4 Score: 7 GPT4 Score: 7 GPT4 Score: 8 GPT4 Score: 7

MiniGPT4

:It is not possible to determine

the current weather in the

location depicted in the image,

as the image does not provide

any information about the

weather conditions. The image

shows a man, a baby, and a

horse standing in front of a

building with a thatched roof,

but it does not provide any

information about the weather or

the environment in the area.

GPT4 Score: 4 MiniGPT4

:It is not possible to determine

whether the man in the image

would get wet if it starts raining,

as the image does not provide

any information about the

current weather conditions or the

environment in the area. The

image shows a man holding a

baby, but it does not provide any

information about the weather or

the environment in the area. MiniGPT4

:In the image, the baby is

reaching out to touch the

horse’s nose. The baby is

held up by the man, who is

standing next to the horse,

and the baby is reaching out

with one hand to touch the

horse’s nose. The baby is

smiling and seems to be

enjoying the interaction

with the horse. MiniGPT4

:The baby in the image is

not wearing any clothes.

GPT4 Score: 5 GPT4 Score: 9 GPT4 Score: 2

BLIP2 : It is a sunny day.

GPT4 Score: 7

LLaVA

Figure 6: Comparison of LaVIN-13B and existing multimodal LLMs in multi-turn conversations.

GPT-4 assigns a score ranging from 1 to 10 to evaluate the quality of a response, with a higher score

indicating superior performance. The noteworthy aspects of the responses are highlighted in green,

whereas the illogical portions are marked in red. More examples are given in appendix.

tually expensive, which requires more than 200 hours. Meanwhile, LLaVA cannot be trained on

common machines with the default training settings 3 . Thus, it requires some GPU memory-saving

techniques [9] to avoid out of memory (OOM). However, its training time and storage requirement are

still significant. For example, it still takes up to 26GB space to store the updated parameters of the

LLM. In contrast, LaVIN demonstrates superior training efficiency with the help of MMA. Compared

to LLaVA, LaVIN-7B and LaVIN-13B reduce about 80% and 71.4% training time, respectively. In

terms of GPU memory and storage cost, our approach can save more than 40% GPU memory and

99.9% disk storage. Overall, these results greatly confirm the training efficiency of MMA.

4.3.2

Multimodal Chatbot

Examples of different instruction-following tasks. In Fig 5, we compare LaVIN with existing

methods [45, 18] on single- and multi-modal instruction-following tasks, e.g., math, coding and

image captioning. Compared to LLaVA [18] and LLaMA-Adapter [45], LaVIN achieves overall

better responses across multiple tasks. In Fig.5 (a), LaVIN correctly answers the math problem

with a result of 28.8, whereas LLaMA-Adapter [37] provides an incorrect answer. In example (d),

LaVIN generates accurate code for the request of “print prime numbers up to 100”. In contrast, the

code written by LLaMA-Adapter is to check prime numbers, which does not produce any output

during execution. Meanwhile, LaVIN presents a clear and concise coding behavior, acting more like

a professional programmer. In Fig 5 (e)-(g), LaVIN demonstrates remarkable visual reasoning ability

https://github.com/haotian-liu/LLaVA

9in accomplishing various multimodal tasks. In Fig.5 (e), LaVIN accurately answers the complex

questions about the number of food containers in the image and provides a detailed description about

the complex scene. The same observation can also be witnessed in Fig.5 (g), where LaVIN infers

a correct reason for the wetness of the boy’s clothes. Overall, these examples show the superior

reasoning ability of LaVIN in executing single- and multi-modal instructions, while also confirming

the significance of MMA in adapting LLMs to multi-modal tasks.

Examples of multimodal dialogue In Fig. 6, we compare LaVIN with existing multimodal LLMs

in multi-turn conversations, and use GPT4 [28] to evaluate the quality of their responses. From the

results, we can see that LaVIN has higher GPT4 scores among all compared models, suggesting

superior ability in multimodal dialogue. Meanwhile, we also observe different response styles of these

multimodal LLMs. In particular, BLIP2 [15] tends to produce brief responses, which lack detailed

explanations. In contrast, the responses of MiniGPT4 [49] are the longest among all models, but their

content is often redundant and repetitive. Compared to them, LaVIN and LLaVA [18] can generate

more accurate responses. Particularly, LaVIN performs better than the other methods, mainly due

to its more logical and detailed descriptions. As illustrated in the first question, LaVIN not only

provides the correct answer, but also explains the reason behind it. In the second question, LaVIN

and LLaVA are required to judge whether the man will get wet, and LaVIN answers “yes" while

LLaVA considers “no". It can be seen that the reason of LaVIN is more comprehensive, logical and

persuasive than LLaVA, which considers the situation of “the overhand may not provide the complete

protection”. Overall, these examples confirm that MMA equips LLMs with excellent multi-modal

ability, requiring no pre-training on large-scale image-text data.

Limitations and Broader Impact

We observe two primary limitations of LaVIN. Firstly, LaVIN may generate incorrect or fabricate

responses, similar to existing multimodal LLMs. Secondly, LaVIN can not identify extremely fine-

grained visual content, such as text characters. We believe that the recognition ability of LaVIN still

has a large room to improve, which will be left in our future work.

Conclusions

In this paper, we propose a novel and affordable solution for vision-language instruction tuning,

namely Mixture-of-Modality Adaptation (MMA). Particularly, MMA is an end-to-end optimization

regime, which connects the image encoder and LLM via lightweight adapters. With the help of

MMA, the entire multimodal LLM can be jointly optimized via a small number of parameters,

greatly reducing the training costs. Meanwhile, we also propose a novel routing algorithm in

MMA, which can help the model automatically shifts the reasoning paths for single- and multi-

modal instructions. Based on MMA, we develop a large vision-language instructed model called

LaVIN, which demonstrates a superior reasoning ability than existing multimodal LLMs in various

instruction-following tasks.

Acknowledgements. This work was supported by National Key R&D Program of China

(No.2022ZD0118201) , the National Science Fund for Distinguished Young Scholars (No.62025603),

the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. 62176222,

No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, No. 62002305 and

No. 62272401), and the Natural Science Foundation of Fujian Province of China (No.2021J01002,

No.2022J06001).

10References

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,

Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for

few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 2, 3

[2] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adapt-

former: Adapting vision transformers for scalable visual recognition. CoRR, abs/2205.13535, 2022.

[3] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-

supervised models are strong semi-supervised learners. Advances in neural information processing systems

(NeurIPS), 33:22243–22255, 2020. 1, 6

[4] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan

Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source

chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2

[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language

modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 1

[6] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi

Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv

preprint arXiv:2210.11416, 2022. 3

[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas

Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and

Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR,

2021. 3, 5, 6

[8] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan

Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language

model. arXiv preprint arXiv:2303.03378, 2023. 2, 3

[9] FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large

scale training. https://github.com/facebookresearch/fairscale, 2021. 9

[10] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified

view of parameter-efficient transfer learning. In ICLR, 2022. 3

[11] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea

Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In ICML,

2019. 3, 4

[12] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and

Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 3

[13] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei

Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with

language models. arXiv preprint arXiv:2302.14045, 2023. 2, 3

[14] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multi-

modal generation. arXiv preprint arXiv:2301.13823, 2023. 3

[15] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training

with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 2, 3, 5, 7,

8, 10

[16] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training

for unified vision-language understanding and generation. In International Conference on Machine

Learning, pages 12888–12900. PMLR, 2022. 2

[17] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In

ACL/IJCNLP, 2021. 3

[18] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint

arXiv:2304.08485, 2023. 2, 3, 5, 6, 7, 8, 9, 10

11[19] Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning

can be comparable to fine-tuning universally across scales and tasks. CoRR, abs/2110.07602, 2021. 3

[20] Ilya Loshchilov and Frank Hutter.

arXiv:1711.05101, 2017. 6

Decoupled weight decay regularization.

arXiv preprint

[21] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter

Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question

answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. 2, 5, 6

[22] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[23] Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji.

Towards efficient visual adaption via structural re-parameterization. arXiv preprint arXiv:2302.08106,

2023. 3, 4, 6

[24] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity?

a new dataset for open book question answering. Proceedings of the 2018 Conference on Empirical

Methods in Natural Language Processing (EMNLP), 2018. 1

[25] Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing

instructional prompts to gptk’s language. ACL Findings, 2021. 1, 3

[26] Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and

Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. The 60th

Annual Meeting of the Association for Computational Linguistics (ACL), 2022. 1, 3

[27] OpenAI. Chatgpt. 2023. 1

[28] OpenAI. Gpt-4 technical report. 2023. 1, 2, 6, 10

[29] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with

human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 1, 3

[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish

Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning

transferable visual models from natural language supervision. In ICML, Proceedings of Machine Learning

Research, 2021. 3, 5, 6

[31] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language

models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 1

[32] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J.

Mach. Learn. Res., 21:140:1–140:67, 2020. 1

[33] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution

image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 10684–10695, 2022. 2

[34] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. 5

[35] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt:

Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 2, 3

[36] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and

Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.

com/tatsu-lab/stanford_alpaca, 2023. 3, 6

[37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation

language models. arXiv preprint arXiv:2302.13971, 2023. 2, 5, 6, 9

[38] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:

A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint

arXiv:1804.07461, 2018. 1

12[39] Yan Wang, Xiaojiang Liu, and Shuming Shi. Deep neural solver for math word problems. In Proceedings

of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 845–854,

2017. 1

[40] Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng

Gao. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint

arXiv:2205.12410, 2022. 3

[41] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and

Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv

preprint arXiv:2212.10560, 2022. 3

[42] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana

Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking

generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705,

2022. 3

[43] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt:

Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 2, 3

[44] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu,

Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and

action. arXiv preprint arXiv:2303.11381, 2023. 2, 3

[45] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and

Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint

arXiv:2303.16199, 2023. 3, 6, 7, 9

[46] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher

Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.

arXiv preprint arXiv:2205.01068, 2022. 1

[47] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal

chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 6, 7

[48] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for

vision-language models. In CVPR, 2022. 3

[49] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-

language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 2,

3, 5, 10