Evaluating embodied intelligence remains a fundamental challenge in the field. Physical-world testing often fails to control all variables, resulting in unreproducible outcomes, high replication costs, and difficulties in conducting large-scale assessments. While simulation-based evaluation addresses reproducibility issues, current virtual platforms suffer from significant reality gaps and lack proper benchmarking against real-world tests, compromising result credibility.
As a core part of the Workshop on Multimodal Robot Learning in Physical Worlds, IROS 2025, this competition considers the above challenges and features two common embodied tasks across dual phases: a simulated round hosted on InternUtopia, a high-fidelity simulation platform, followed by real-world testing. This dual-phase design aims to drive innovation in both model architectures and training methodologies within the field.
Participants are required to utilize interactively collected data from either real-world or simulated environments to accurately interpret complex scenarios and make contextually appropriate decisions. The challenge explores key issues in transferring skills from simulation to reality, including domain gaps, multimodal information fusion (vision, language, and action), and scalable sim-to-real transfer techniques.
We hope this event will bring together researchers and practitioners from around the world to explore cutting-edge topics in multimodal robot learning, laying a solid foundation for the future development of intelligent robotics.
This track focuses on developing multimodal robotic manipulator capable of understanding and executing task instructions. Participants are required to design end-to-end control policy models that integrate visual perception, instruction following, and action prediction. The robot must operate within a simulated physics-based environment, using platforms such as robotic arms or mobile dual-arm systems, to carry out a variety of manipulation tasks. The challenges are rooted in open tabletop scenarios, diverse task instructions, and multiple manipulation skills.
This track focuses on developing multimodal mobile robot navigation systems with language understanding capabilities. Participants are required to devise a navigation agentthat is capable of egocentric visual perception and natural language instruction comprehension to trajectory history modeling and navigation action prediction. The agent will be evaluated in a realistic physics-based simulation environment, operating a legged robot (e.g., the humanoid Unitree H1) to perform indoor navigation tasks guided by language instructions. The system should be capable of handling challenges such as camera shake, height variation, and local obstacle avoidance, ultimately achieving robust and safe vision-and-language navigation.
Competition Start & Materials Release
Train your model using the provided data and baselines, or develop your own pipeline.
Test Server Open
Test Server Close
Results Announcement
Final rankings will be announced on October 10th.
Onsite Challenge
In each track, we will select up-to-8 teams for the onsite challenge.
Results Announcement
On October 20th, the winners will present their work at our workshop in Hangzhou, China.
$10000
$5000
$3000
Travel Grant: $1,500
For each selected team from both tracks
Additional prizes and official certificates will also be awarded
Top performers can receive:
Internship opportunity at Shanghai AI Lab
Direct access to the campus recruitment interview, entering the JOB TALK stage
A participant must be a team member and cannot be a member of multiple teams.
Participants can form teams of up to 10 members.
A team is limited to one submission account.
A team can participate in multiple tracks.
An entity can have multiple teams.
Attempting to hack the test set or engaging in similar behaviors will result in disqualification.
All publicly available datasets and pretrained weights are allowed.
Unauthorized access to test sets is strictly prohibited.
All participants that make a valid submission and submit their team name and related information before the leaderboard is opened will receive an electronic certificate of participation.
Teams must make their results public on the leaderboard before the submission deadline.
Code or Docker image must be opensourced.
Organizers reserve the right to update the rules or disqualify teams for violations. Winners will be awarded the following prizes (per track):
Organizer:Shanghai AI Lab
Co-organizer:ManyCore Tech, University of Adelaide
Sponsors (order not indicative of ranking):ByteDance, HUAWEI, ENGINEAI, HONOR, ModelScope,
Alibaba
Cloud, AGILEX, DOBOT
We gratefully acknowledge the collaborations from our excellent contributors.
Ning Gao, Jinyu Zhang, Zhi Hou, Yunsong Zhou, Yanqing Shen, Jiantong Chen, Shihan Tian, Xuekun Jiang, Qianyu Ye, Jialeng Ni, Zekai Huang, Mengchen Ma, Fangjing Wang, Xinyi Chen, Zimian Peng, Zheng Zhou,
Liuyi Wang, Hui Zhao, Xinyuan Xia, Yukai Wang, Sihao Lin, Rong Wei, Zheng Zhou, Yude Zou, Xing Gao