Shilong Liu Homepage
Hi! This is Shilong Liu (刘世隆). I am a Postdoctoral Research Fellow (AI^2 Fellow) at the Princeton AI Lab, Princeton University, under the supervision of Prof. Mengdi Wang. I obtained my Ph.D. from the Department of Computer Science and Technology, Tsinghua University, under the supervision of Prof. Lei Zhang (at IDEA Research), Prof. Hang Su, and Prof. Jun Zhu. I received my B.Eng. degree from the Department of Industrial Engineering, Tsinghua University, in 2020.
Before joining Princeton, I was a Research Scientist at Bytedance Seed. During my Ph.D. and research career, I have had the privilege to intern and collaborate at leading research labs, including Bytedance, NVIDIA Research, Microsoft Research Redmond, IDEA Research, and Shengshu Tech, working with amazing mentors such as Dr. Guilin Liu, Dr. Zhiding Yu, Dr. Chunyuan Li, Dr. Hao Cheng, Dr. Jianwei Yang, and Dr. Guang Shi.
🎯 Research Interests
My research goal is to build autonomous AI systems for real-world applications, including Embodied AI and Web Agents. To achieve this, I explore new techniques including Multimodal Foundations Models, Physical Intelligence, and Tool-use/Tool Creations. I am also interested in interdisciplinary research that bridges the gap between AI and other scientific domains with the AI systems.
🚀 Work With Me
Looking for collaborations and self-motivated interns excited about multimodal AI/agents research and their real-world applications. Contact me with my email: slongliu86@gmail.com and shilong.liu@princeton.edu.
🔬 Representative Works
-
Visual Perception & DETR Evolution
We introduced a series of Transformer-based detection models including DAB-DETR, DN-DETR
, DINO
, MaskDINO
, and Stable-DINO
. DINO was the first DETR-like model to achieve state-of-the-art performance on the COCO object detection leaderboard.
-
Open-world Visual Understanding & Multimodal Models
We developed Grounding DINOand Grounded-SAM
, empowering models to detect and segment anything. Grounding DINO is now the most downloaded zero-shot object detection model on Hugging Face and receives over 2 million downloads per month. The subsequent series, including Grounding-DINO-1.5, 1.6, and DINO-X, continues to push open-world perception forward.
- Multimodal Agents & Real-world Applications
We introduced LLaVA-Plus, enhancing multimodal large language models with vision-expert tool usage to build general multimodal agents. We also extended this agent pipeline to diverse domains:
- We proposed Alita, a generalist deep-research agent that ranked 1st on the GAIA benchmark, surpassing OpenAI Deep Research.
- We proposed Avenir-Web, the SOTA open-source web agent on the Online-Mind2Web benchmark, with code available.
- We proposed CubeBench and CubeAgent, exploring multimodal AI agents for Rubik’s Cube solving and challenging their world modeling and spatial reasoning capabilities.
- We also proposed MMedAgents and AMS-IO-Agent to extend agent systems to medical reasoning and chip designs.
- World Models & Generative Models
We proposed Web World Models, which decouple world states from content through web code, enabling the creation of unlimited, personalized, and controllable web pages. I contributed to the first version of Vidu as well, a general video generation model that has since grown into a leading video-generation company.
🏆 Awards & Recognitions
- WAIC Yunfan Award – Rising Star, 2024 (15 people/year)
- KAUST AI Rising Star, 2024 (Top 15%)
- CCF-CV Academic Emerging Scholar Award, 2023 (3 people/year)
- First Prize of Innovation 84 Scholarship, Tsinghua University 2024.
- Outstanding Graduate/Outstanding Thesis, Tsinghua University, 2025.
📈 Impacts
- 16,000+ Google Scholar citations
- 30,000+ GitHub stars
- 4 papers selected as Top 15 Most Influential Papers by Paper Digest
If you are interested in multimodal AI/agents research and their real-world applications, feel free to reach out at:
📧 slongliu86 [AT] gmail.com or shilong.liu [AT] princeton.edu
(Note: the Tsinghua email is deprecated; please use Gmail or Princeton email instead.)
Feel free to add me on WeChat: SLONG_88 (please include a short self-introduction).
Google Scholar | GitHub | LinkedIn | X/Twitter | Zhihu 知乎 | RedNote 小红书 | CV (01-2026)
News
| Feb 22, 2026 | Web World Models is featured on Princeton AI Lab News. [Link] |
|---|---|
| Feb 12, 2026 | We will organize the 2nd Workshop on Test-Time Scaling for Computer Vision (ViSCALE) at CVPR 2026. We warmly welcome paper submissions and hope you’ll join us. [link] |
| Jan 22, 2026 | Two papers, CubeBench and TAPTRv3, have been accepted to ICLR 2026. Congratulations! |
| Nov 22, 2025 | Congratulation to Visincept on raising nearly CNY 100 million (about USD 14 million). It’s great to see the Grounding DINO and DINO series making an even bigger impact in industry. [News] |
| Oct 15, 2025 | Happy to share that I’ve joined the AI Lab at Princeton University as a postdoctoral researcher. See you on the East Coast! |
| Jun 1, 2025 | Happy to graduate from Tsinghua University with a PhD. Huge thanks to my advisors, supervisors, and all my collaborators! |
| Nov 11, 2024 | Invited talk at EECS 542, University of Michigan. [Slides] |
| Jul 22, 2024 | Start my internship at NVIDIA, collabrating with Guilin Liu and Zhiding Yu. See you at the Bay Area, USA. |