Publications

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang et al

arXiv:2209.06794 (2022) paper link Google AI blog

PaLI (Pathways Language and Image model) achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

arXiv:2209.05534 (2022) paper link

We propose PreSTU, a simple pre-training recipe specifically designed for scene-text understanding.

Towards Multi-Lingual Visual Question Answering

Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V. Thapliyal, Julien Amelot, Xi Chen, Radu Soricut

arXiv:2209.05401 (2022) paper link

We propose scalable solutions to multi-lingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers.

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Ashish V. Thapliyal, Jordi Pont-Tuset, Xi Chen, Radu Soricut

arXiv:2205.12522 (2022) paper link

We present the Crossmodal-3600 dataset (XM3600 in short), a geographically diverse set of 3600 images annotated with human-generated reference captions in 36 languages.

All You May Need for VQA are Image Captions

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut

arXiv:2205.01883 (2022) paper link Google AI blog

We propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation.

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Nan Ding, Xi Chen, Tomer Levinboim, Beer Changpinyo, Radu Soricut

arXiv:2203.05126 (2022) paper link

We present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement.

Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Nan Ding, Xi Chen, Tomer Levinboim, Sebastian Goodman, Radu Soricut

NeurIPS 2021 paper link

We develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories.