Humans have started interacting with the world through the two best pillars of language and vision. This is all because of the super good capabilities of the recently popularized Large Language Models (LLMs). LLMs have taken the world by storm with their significantly increasing performance. LLMs like GPT-3, T5, PaLM, etc., have started imitating humans by learning to read, summarize and generate textual data.
Researchers in the field of Artificial Intelligence have been developing a general-purpose assistant that can effectively follow multimodal vision-and-language instructions aligned with human intent to complete real-world tasks easily. For this, language-augmented foundation vision models in open-world visual understanding are being developed to perform tasks such as classification, detection, segmentation, captioning, visual generation, and editing. With the release of GPT-4 by OpenAI, the transformer model behind the famous chatbot, ChatGPT, and its multimodal capabilities of it have proved to be a good addition to the list of LLMs.
In a recent research paper, the authors have presented the first attempt to use GPT-4 to generate multimodal language-image instruction-following data. The team has introduced LLaVA, a Large Language and Vision Assistant, an end-to-end trained large multimodal model connecting a vision encoder and Vicuna for general-purpose visual and language understanding. Vicuna is an open-source chatbot with 13B parameters which has been trained by fine-tuning LLaMA on user-shared conversations.
LLaVa is an attempt to extend instruction tuning to the multimodal space. The main objective is to enable users to have their real-time tasks completed with the help of a visual assistant that can effectively follow multimodal vision-and-language instructions aligned with human intent. The significant contributions made by the team are as follows –
- Multimodal instruction-following data – The team has presented a data reformation perspective and pipeline to convert image-text pairs into the instruction-following format with the help of the GPT-4 model.
- Large multimodal models – The team has developed a large multimodal model by connecting the open-set visual encoder of CLIP with the language decoder LLaMA and fine-tuning them end-to-end on the generated instructional vision-language data.
- The empirical study tries to validate the effectiveness of user-generated data for LMM instruction tuning. It even suggests practical tips for building a general-purpose instruction-following visual agent.
- SOTA performance has been achieved with the help of GPT-4 on the Science QA multimodal reasoning dataset.
- Open-Source nature – The project is open source, and the generated multimodal instruction data, the codebase for data generation and model training, the model checkpoint, and a visual chat demo are open to the public for access and can be accessed at https://github.com/haotian-liu/LLaVA.
LLaVA has demonstrated impressive multimodal chat abilities and achieved an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, LLaVA and GPT-4 synergy achieved a new SOTA accuracy of 92.53%. The results make LLaVA a promising approach and a great contribution to the released language models.
Check out the Research Paper, Code, and Project. Don’t forget to join our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.
Credit: Source link