Pengchuan Zhang
Meta Superintelligence Labs@Meta
Menlo Park, CA 94025
United States
I’m an AI research scientist at Segment Anything team of Meta Superintelligence Labs (previous FAIR computer vision team) and an affiliate assistant professor in the department of Electrical Engineering, University of Washington. I was a principal researcher at Microsoft Research, Redmond. Before joining Microsoft, I obtained my PhD degree in Applied and Computational Mathematics from Caltech in 2017. My research interests are mainly in the areas of deep learning, computer vision, multimodal intelligence, and theoretical foundations for deep learning.
news
| Nov 19, 2025 |
I’m excited to share SAM 3, a unified model that enables detection, segmentation and tracking of objects across images and videos. SAM 3 introduces some of our most requested features like using text and exemplar prompts to segment all objects of a target category. We released both the model weights and a new open-vocab detection, segmentation and tracking benchmark with permissive license.
Try it out:
|
|---|---|
| Apr 5, 2025 |
Llama 4 was released and open-sourced. I’m proud to continue leading the visual grounding effort from Llama 3 to Llama 4. We implemented state-of-the-art input-side and output-side visual grounding capabilities in Llama 3. Llama 4 achieved state-of-the-art performance on the Visual Commonsense Reasoning benchmark and the RefCoCo benchmark. More importantly, it works in real world scenarios and has powered several product features. Expert image grounding is highlighted as a key differentiator for Llama 4.
|
| Jul 31, 2024 |
Llama 3 was released and open-sourced. I’m proud to be a core contributor to Llama 3 and to have led the visual grounding effort. We implemented state-of-the-art input-side visual grounding capabilities in Llama 3. Under the anonymous name “OV-Grounding”, Llama 3 is the first model to reach human performance on the Visual Commonsense Reasoning leaderboard.
|
| Jun 17, 2024 | Two papers from our group received awards at CVPR 2024 — congratulations to the authors and collaborators! |
| Oct 22, 2022 |
Our ECCV 2022 workshop “Computer Vision in the Wild” will take place on October 22, 2022. See the workshop site: Computer Vision in the Wild — ECCV 2022.
Schedule (local times):
|
selected publications
-
Using statistics to automate stochastic optimizationAdvances in Neural Information Processing Systems 2019
-
Vinvl: Revisiting visual representations in vision-language modelsIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021
-
A convex relaxation barrier to tight robustness verification of neural networksarXiv preprint arXiv:1902.08722 2019
-
Attngan: Fine-grained text to image generation with attentional generative adversarial networksIn Proceedings of the IEEE conference on computer vision and pattern recognition 2018
-
Provably robust deep learning via adversarially trained smoothed classifiersarXiv preprint arXiv:1906.04584 2019
-
Multi-scale vision longformer: A new vision transformer for high-resolution image encodingarXiv preprint arXiv:2103.15358 2021
-
Multiscale Invertible Generative Networks for High-Dimensional Bayesian InferencearXiv preprint arXiv:2105.05489 2021
-
Florence: A New Foundation Model for Computer VisionarXiv preprint arXiv:2111.11432 2021
-
Grounded Language-Image Pre-trainingarXiv preprint arXiv:2112.03857 2021
-
Egovlpv2: Egocentric video-language pre-training with fusion in the backboneIn Proceedings of the IEEE/CVF International Conference on Computer Vision 2023
-
Univtg: Towards unified video-language temporal groundingIn Proceedings of the IEEE/CVF International Conference on Computer Vision 2023
-
Minigpt-v2: large language model as a unified interface for vision-language multi-task learningarXiv preprint arXiv:2310.09478 2023
-
The llama 3 herd of modelsarXiv e-prints 2024
-
Evaluating text-to-visual generation with image-to-text generationIn European Conference on Computer Vision 2024