Since the beginning of 2021, the development of numerous text-to-image models powered by deep learning, including Midjourney, Stable Diffusion, and DALL-E-2, to mention a few, has completely changed the landscape of AI research. Google’s Muse, a text-to-image Transformer model that aspires to reach cutting-edge image generating performance, is another name to add to the list.
The text embedding from a previously trained large language model (LLM) is used to train Muse on a masked modelling project in discrete token space. Muse has been trained to forecast tokens of randomly masked images. Muse claims to be more effective than pixel-space diffusion models like Imagen and DALL-E 2 since it uses discrete tokens and requires fewer sample iterations. Based on a linguistic cue, the model iteratively resamples image tokens to create a zero-shot, mask-free editing for nothing.
Muse employs parallel decoding as opposed to Parti and other autoregressive models. A pre-trained LLM is capable of interpreting fine-grained language, converting to high-fidelity image production, and recognising visual concepts like objects, their spatial relationships, pose, cardinality, etc. Muse also allows for mask-free editing, inpainting, and outpainting without altering or flipping the model. The 900M parameter model achieves a new SOTA on CC3M with a FID score of 6.06. The Muse 3B parameter model receives a FID of 7.88 and a CLIP score of 0.32 on zero-shot COCO evaluation.
The text encoder provides a text embedding that is utilised for cross-attention with picture tokens for both the base and super-res Transformer layers. The base model then employs a VQ Tokenizer, which, after pre-training on lower quality (256*256) images, creates a 16*16 latent space of tokens. The masked tokens have been masked at a variable rate for each sample, and the cross-entropy loss then develops the ability to forecast them. The super-res model is then given the rebuilt lower-resolution tokens and text tokens once the basic model has been trained. With a better level of resolution, the model can now predict masked tokens.