Researchers from Microsoft’s Natural Language Computing (NLC) group announced the latest version of Bidirectional Encoder representation from Image Transformers: BEiT-3, a 1.9B parameter vision-language AI model. BEiT-3 models images as another language and achieves state-of-the-art performance on a wide range of downstream tasks.
The model and experiments were described in a paper published on arXiv. The key idea in BEiT-3 is to model images as another language (which the authors call “Imglish”); this allows the model to be pretrained using only the masked language modeling (MLM) objective, and the training process can therefore be scaled up more easily.