//
1 min read

What is all the buzz around ‘Visual ChatGPT’ and what all can it do for you?

Microsoft has recently announced a new model called Visual ChatGPT, which combines visual foundation models with ChatGPT to enable users to interact beyond language. While ChatGPT has excellent conversational competence and reasoning abilities, it is not capable of processing or producing images from the visual environment. On the other hand, visual foundation models can comprehend and generate images but are limited to specialized tasks with fixed inputs and outputs. Visual ChatGPT merges these models to provide a language interface for images.

Visual ChatGPT allows users to transmit and receive both text and images and provides complex visual inquiries or visual editing instructions, requiring the collaboration of multiple AI models with various phases. This new model can offer input and request corrections. To inject visual model information into ChatGPT, Microsoft researchers have created a series of prompts, enabling Visual ChatGPT to investigate the visual roles of ChatGPT using visual foundation models.

Visual ChatGPT is trained to retain conversational context, reply appropriately to follow-up questions, and produce accurate responses. However, its capacity to process visual data is limited to a single language modality. In contrast, VFMs have tremendous potential in computer vision due to their ability to interpret and create complex images. Although VFMs are less versatile than conversational language models, they can be combined with ChatGPT to create a system that perceives and generates visual information.

However, training a multimodal conversational model would require large amounts of data and computing power. For instance, if a user uploads a picture of a black elephant with a hard-to-understand instruction, Visual ChatGPT uses linked visual foundation models to process the image and make it understandable. In their work, the researchers note that the failure of VFMs and the irregularity of the Prompt are areas of concern, as they result in less-than-satisfactory generation results. Therefore, a single self-correcting module is required to ensure that execution outputs are consistent with human intentions and make the necessary corrections.

In conclusion, Visual ChatGPT is an exciting test case for generative AI to create and customize images with a simple voice command. It can provide a more comprehensive and versatile conversational interface for images and expand the capabilities of language models beyond language. The researchers will continue investigating how to improve the system’s accuracy and reduce its inference time in future studies.

Leave a Reply