Google Research announced the development of A Large-scale ImaGe and Noisy-Text Embedding (ALIGN), an 800M-parameter pre-trained deep-learning model trained on a noisy dataset of 1.8B image-text pairs. The model can be used on several downstream tasks and achieves state-of-the-art accuracy on several image-text retrieval benchmarks.
Researchers Chao Jia and Yinfei Yang gave an overview of the work in a recent blog post. The team scraped html pages from the web and used the alt-text tags associated with the images to produce a dataset of image-text pairs. The ALIGN model, which is a combination of a BERT-style natural language processing (NLP) encoder and EfficientNet-style computer vision (CV) encoder, was pre-trained on this dataset.