Shilong Liu Homepage

Hi! This is Shilong Liu (刘世隆). I am a Postdoctoral Research Fellow (AI^2 Fellow) at the Princeton AI Lab, Princeton University, under the supervision of Prof. Mengdi Wang. I obtained my Ph.D. from the Department of Computer Science and Technology, Tsinghua University, under the supervision of Prof. Lei Zhang (at IDEA Research), Prof. Hang Su, and Prof. Jun Zhu. I received my B.Eng. degree from the Department of Industrial Engineering, Tsinghua University, in 2020.

Before joining Princeton, I was a Research Scientist at Bytedance Seed. During my Ph.D. and research career, I have had the privilege to intern and collaborate at leading research labs, including Bytedance, NVIDIA Research, Microsoft Research Redmond, IDEA Research, and Shengshu Tech, working with amazing mentors such as Dr. Guilin Liu, Dr. Zhiding Yu, Dr. Chunyuan Li, Dr. Hao Cheng, Dr. Jianwei Yang, and Dr. Guang Shi.

🎯 Research Interests

My research goal is to build autonomous AI systems for real-world applications, including Embodied AI and Web Agents. To achieve this, I explore new techniques including Multimodal Foundations Models, Physical Intelligence, and Tool-use/Tool Creations. I am also interested in interdisciplinary research that bridges the gap between AI and other scientific domains with the AI systems.

🚀 Work With Me

Looking for collaborations and self-motivated interns excited about multimodal AI/agents research and their real-world applications. Contact me with my email: slongliu86@gmail.com and shilong.liu@princeton.edu.

🔬 Representative Works

Visual Perception & DETR Evolution
We introduced a series of Transformer-based detection models including DAB-DETR , DN-DETR , DINO , MaskDINO , and Stable-DINO . DINO was the first DETR-like model to achieve state-of-the-art performance on the COCO object detection leaderboard.
Open-world Visual Understanding & Multimodal Models
We developed Grounding DINO and Grounded-SAM , empowering models to detect and segment anything. Grounding DINO is now the most downloaded zero-shot object detection model on Hugging Face and receives over 2 million downloads per month. The subsequent series, including Grounding-DINO-1.5, 1.6, and DINO-X, continues to push open-world perception forward.
Multimodal Agents & Real-world Applications
We introduced LLaVA-Plus, enhancing multimodal large language models with vision-expert tool usage to build general multimodal agents. We also extended this agent pipeline to diverse domains:
- We proposed Alita, a generalist deep-research agent that ranked 1st on the GAIA benchmark, surpassing OpenAI Deep Research.
- We proposed Avenir-Web, the SOTA open-source web agent on the Online-Mind2Web benchmark, with code available.
- We proposed CubeBench and CubeAgent, exploring multimodal AI agents for Rubik’s Cube solving and challenging their world modeling and spatial reasoning capabilities.
- We also proposed MMedAgents and AMS-IO-Agent to extend agent systems to medical reasoning and chip designs.
World Models & Generative Models
We proposed Web World Models, which decouple world states from content through web code, enabling the creation of unlimited, personalized, and controllable web pages. I contributed to the first version of Vidu as well, a general video generation model that has since grown into a leading video-generation company.

🏆 Awards & Recognitions

WAIC Yunfan Award – Rising Star, 2024 (15 people/year)
KAUST AI Rising Star, 2024 (Top 15%)
CCF-CV Academic Emerging Scholar Award, 2023 (3 people/year)
First Prize of Innovation 84 Scholarship, Tsinghua University 2024.
Outstanding Graduate/Outstanding Thesis, Tsinghua University, 2025.

📈 Impacts

16,000+ Google Scholar citations
30,000+ GitHub stars
4 papers selected as Top 15 Most Influential Papers by Paper Digest

If you are interested in multimodal AI/agents research and their real-world applications, feel free to reach out at:
📧 slongliu86 [AT] gmail.com or shilong.liu [AT] princeton.edu
(Note: the Tsinghua email is deprecated; please use Gmail or Princeton email instead.)

Feel free to add me on WeChat: SLONG_88 (please include a short self-introduction).

News

Feb 22, 2026	Web World Models is featured on Princeton AI Lab News. [Link]
Feb 12, 2026	We will organize the 2nd Workshop on Test-Time Scaling for Computer Vision (ViSCALE) at CVPR 2026. We warmly welcome paper submissions and hope you’ll join us. [link]
Jan 22, 2026	Two papers, CubeBench and TAPTRv3, have been accepted to ICLR 2026. Congratulations!
Nov 22, 2025	Congratulation to Visincept on raising nearly CNY 100 million (about USD 14 million). It’s great to see the Grounding DINO and DINO series making an even bigger impact in industry. [News]
Oct 15, 2025	Happy to share that I’ve joined the AI Lab at Princeton University as a postdoctoral researcher. See you on the East Coast!
Jun 1, 2025	Happy to graduate from Tsinghua University with a PhD. Huge thanks to my advisors, supervisors, and all my collaborators!
Nov 11, 2024	Invited talk at EECS 542, University of Michigan. [Slides]
Jul 22, 2024	Start my internship at NVIDIA, collabrating with Guilin Liu and Zhiding Yu. See you at the Bay Area, USA.