Pengchuan Zhang

work page


Meta AI for VR

Menlo Park, CA 94025

United States

I’m a AI research scientist at Meta AI for VR and an affiliate assistant professor in the department of Electrical Engineering, University of Washington. I was a principal researcher at Microsoft Research, Redmond. Before joining Microsoft, I obtained my PhD degree in Applied and Computational Mathematics from Caltech in 2017. My research interests are mainly in the areas of deep learning, computer vision, multimodal intelligence, and theoretical foundations for deep learning.


Jun 8, 2022 2 pieces of updates on our recent vision-and-language efforts: (i) Our CVPR 2022 tutorial will happen on 06/19/2022; (ii) Our ECCV2022 workshop of Computer Vision in the Wild will happen at ECCV2022 in October 2022. There will be two challenges associated with this workshop: Image Classification in the Wild and Object Detection in the Wild. The challange setup and baselines can be found in our ELEVATER benchmark paper. Stay tuned for more details.
May 3, 2022 I’m starting a new position as Research Scientist at Meta AI for VR. I will continue my long-term pursuit of CV and Multi-modal intelligence at my new position. Looking forward to work with colleagues and the entire community, to build intelligent and trust-able technologies for the metaverse, and to push the research frontier of deep learning, CV and multi-modal.
Mar 11, 2022 Gave a talk at Applied and Computational Mathematics Seminar at University of Wisconsin at Madison. The talk is about my ICLM 2021 work ‚ÄúMultiscale Invertible Generative Networks for High-Dimensional Bayesian Inference‚ÄĚ. The talk slides can be found at here.
Mar 4, 2022 Four papers are accepted to CVPR2022. ‚ÄúGrounded Language-Image Pre-training (GLIP)‚ÄĚ, ‚ÄúRegionCLIP: Region-based Language-Image Pretraining‚ÄĚ, ‚ÄúAn Empirical Study of Training End-to-End Vision-and-Language Transformers‚ÄĚ and ‚ÄúUnified Contrastive Learning in Image-Text-Label Space‚ÄĚ. Congratulations to all collaborators! Source code is opensourced or is under opensource process!
Jan 29, 2022 Slides of my Vision-Language tutorial. Vision-Language Understanding is an hot research area at the intersection of computer vision (CV) and natural language processing (NLP). The main topics in this area are vision-language representation learning and various vision-language tasks, including various CV tasks (image classification, object detection, etc) and VL tasks (image/text retrieval, VQA, image captioning, etc). This tutorial has three lectures: Lecture 1 presents a few important and representative early works in vision-language. Lecture 2 focuses on vision-language pre-training. Lecture 3 introduces a few recent vision-language pre-training at even larger scales, and summarizes three trends in this area.

selected publications

  1. Using statistics to automate stochastic optimization
    Lang, Hunter, Xiao, Lin, and Zhang, Pengchuan
    Advances in Neural Information Processing Systems 2019
  2. Vinvl: Revisiting visual representations in vision-language models
    Zhang, Pengchuan, Li, Xiujun, Hu, Xiaowei, Yang, Jianwei, Zhang, Lei, Wang, Lijuan, Choi, Yejin, and Gao, Jianfeng
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021
  3. A convex relaxation barrier to tight robustness verification of neural networks
    Salman, Hadi, Yang, Greg, Zhang, Huan, Hsieh, Cho-Jui, and Zhang, Pengchuan
    arXiv preprint arXiv:1902.08722 2019
  4. Attngan: Fine-grained text to image generation with attentional generative adversarial networks
    Xu, Tao,  Zhang, Pengchuan, Huang, Qiuyuan, Zhang, Han, Gan, Zhe, Huang, Xiaolei, and He, Xiaodong
    In Proceedings of the IEEE conference on computer vision and pattern recognition 2018
  5. Provably robust deep learning via adversarially trained smoothed classifiers
    Salman, Hadi, Yang, Greg, Li, Jerry,  Zhang, Pengchuan, Zhang, Huan, Razenshteyn, Ilya, and Bubeck, Sebastien
    arXiv preprint arXiv:1906.04584 2019
  6. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding
    Zhang, Pengchuan, Dai, Xiyang, Yang, Jianwei, Xiao, Bin, Yuan, Lu, Zhang, Lei, and Gao, Jianfeng
    arXiv preprint arXiv:2103.15358 2021
  7. Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference
    Zhang, Shumao,  Zhang, Pengchuan, and Hou, Thomas Y
    arXiv preprint arXiv:2105.05489 2021
  8. Florence: A New Foundation Model for Computer Vision
    Yuan, Lu, Chen, Dongdong, Chen, Yi-Ling, Codella, Noel, Dai, Xiyang, Gao, Jianfeng, Hu, Houdong, Huang, Xuedong, Li, Boxin, Li, Chunyuan, and others,
    arXiv preprint arXiv:2111.11432 2021
  9. Grounded Language-Image Pre-training
    Li, Liunian Harold,  Zhang, Pengchuan, Zhang, Haotian, Yang, Jianwei, Li, Chunyuan, Zhong, Yiwu, Wang, Lijuan, Yuan, Lu, Zhang, Lei, Hwang, Jenq-Neng, and others,
    arXiv preprint arXiv:2112.03857 2021