Pengchuan Zhang

prof_pic.jpg

Meta Superintelligence Labs@Meta

Menlo Park, CA 94025

United States

I’m an AI research scientist at Segment Anything team of Meta Superintelligence Labs (previous FAIR computer vision team) and an affiliate assistant professor in the department of Electrical Engineering, University of Washington. I was a principal researcher at Microsoft Research, Redmond. Before joining Microsoft, I obtained my PhD degree in Applied and Computational Mathematics from Caltech in 2017. My research interests are mainly in the areas of deep learning, computer vision, multimodal intelligence, and theoretical foundations for deep learning.

news

Nov 19, 2025 I’m excited to share SAM 3, a unified model that enables detection, segmentation and tracking of objects across images and videos. SAM 3 introduces some of our most requested features like using text and exemplar prompts to segment all objects of a target category. We released both the model weights and a new open-vocab detection, segmentation and tracking benchmark with permissive license. Try it out: Thanks to all collaborators and contributors on this project.
Apr 5, 2025 Llama 4 was released and open-sourced. I’m proud to continue leading the visual grounding effort from Llama 3 to Llama 4. We implemented state-of-the-art input-side and output-side visual grounding capabilities in Llama 3. Llama 4 achieved state-of-the-art performance on the Visual Commonsense Reasoning benchmark and the RefCoCo benchmark. More importantly, it works in real world scenarios and has powered several product features. Expert image grounding is highlighted as a key differentiator for Llama 4. Thanks to all collaborators and contributors on this project.
Jul 31, 2024 Llama 3 was released and open-sourced. I’m proud to be a core contributor to Llama 3 and to have led the visual grounding effort. We implemented state-of-the-art input-side visual grounding capabilities in Llama 3. Under the anonymous name “OV-Grounding”, Llama 3 is the first model to reach human performance on the Visual Commonsense Reasoning leaderboard. Thanks to all collaborators and contributors on this project.
Jun 17, 2024 Two papers from our group received awards at CVPR 2024 — congratulations to the authors and collaborators!
  1. EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone was awarded as an EgoVis (Egocentric Vision) 2022/2023 Distinguished Paper!
  2. GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation won Best Short Paper Award at SynData@CVPR2024!
Oct 22, 2022 Our ECCV 2022 workshop “Computer Vision in the Wild” will take place on October 22, 2022. See the workshop site: Computer Vision in the Wild — ECCV 2022. Schedule (local times):
  • Israel: 09:00–18:00
  • Pacific Time: 23:00 (Oct 21)–08:00 (Oct 22)
  • Beijing: 14:00–23:00
I will be chairing the morning session. Please join us!

selected publications

  1. Using statistics to automate stochastic optimization
    Lang, Hunter, Xiao, Lin, and Zhang, Pengchuan
    Advances in Neural Information Processing Systems 2019
  2. Vinvl: Revisiting visual representations in vision-language models
    Zhang, Pengchuan, Li, Xiujun, Hu, Xiaowei, Yang, Jianwei, Zhang, Lei, Wang, Lijuan, Choi, Yejin, and Gao, Jianfeng
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021
  3. A convex relaxation barrier to tight robustness verification of neural networks
    Salman, Hadi, Yang, Greg, Zhang, Huan, Hsieh, Cho-Jui, and Zhang, Pengchuan
    arXiv preprint arXiv:1902.08722 2019
  4. Attngan: Fine-grained text to image generation with attentional generative adversarial networks
    Xu, Tao,  Zhang, Pengchuan, Huang, Qiuyuan, Zhang, Han, Gan, Zhe, Huang, Xiaolei, and He, Xiaodong
    In Proceedings of the IEEE conference on computer vision and pattern recognition 2018
  5. Provably robust deep learning via adversarially trained smoothed classifiers
    Salman, Hadi, Yang, Greg, Li, Jerry,  Zhang, Pengchuan, Zhang, Huan, Razenshteyn, Ilya, and Bubeck, Sebastien
    arXiv preprint arXiv:1906.04584 2019
  6. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding
    Zhang, Pengchuan, Dai, Xiyang, Yang, Jianwei, Xiao, Bin, Yuan, Lu, Zhang, Lei, and Gao, Jianfeng
    arXiv preprint arXiv:2103.15358 2021
  7. Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference
    Zhang, Shumao,  Zhang, Pengchuan, and Hou, Thomas Y
    arXiv preprint arXiv:2105.05489 2021
  8. Florence: A New Foundation Model for Computer Vision
    Yuan, Lu, Chen, Dongdong, Chen, Yi-Ling, Codella, Noel, Dai, Xiyang, Gao, Jianfeng, Hu, Houdong, Huang, Xuedong, Li, Boxin, Li, Chunyuan, and others,
    arXiv preprint arXiv:2111.11432 2021
  9. Grounded Language-Image Pre-training
    Li, Liunian Harold,  Zhang, Pengchuan, Zhang, Haotian, Yang, Jianwei, Li, Chunyuan, Zhong, Yiwu, Wang, Lijuan, Yuan, Lu, Zhang, Lei, Hwang, Jenq-Neng, and others,
    arXiv preprint arXiv:2112.03857 2021
  10. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
    Pramanick, Shraman, Song, Yale, Nag, Sayan, Lin, Kevin Qinghong, Shah, Hardik, Shou, Mike Zheng, Chellappa, Rama, and Zhang, Pengchuan
    In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023
  11. Univtg: Towards unified video-language temporal grounding
    Lin, Kevin Qinghong,  Zhang, Pengchuan, Chen, Joya, Pramanick, Shraman, Gao, Difei, Wang, Alex Jinpeng, Yan, Rui, and Shou, Mike Zheng
    In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023
  12. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
    Chen, Jun, Zhu, Deyao, Shen, Xiaoqian, Li, Xiang, Liu, Zechun,  Zhang, Pengchuan, Krishnamoorthi, Raghuraman, Chandra, Vikas, Xiong, Yunyang, and Elhoseiny, Mohamed
    arXiv preprint arXiv:2310.09478 2023
  13. The llama 3 herd of models
    Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, and others,
    arXiv e-prints 2024
  14. Evaluating text-to-visual generation with image-to-text generation
    Lin, Zhiqiu, Pathak, Deepak, Li, Baiqi, Li, Jiayao, Xia, Xide, Neubig, Graham,  Zhang, Pengchuan, and Ramanan, Deva
    In European Conference on Computer Vision 2024