Challenge Background

Evaluating embodied intelligence remains a fundamental challenge in the field. Physical-world testing often fails to control all variables, resulting in unreproducible outcomes, high replication costs, and difficulties in conducting large-scale assessments. While simulation-based evaluation addresses reproducibility issues, current virtual platforms suffer from significant reality gaps and lack proper benchmarking against real-world tests, compromising result credibility.

As a core part of the Workshop on Multimodal Robot Learning in Physical Worlds, IROS 2025, this competition considers the above challenges and features two common embodied tasks across dual phases: a simulated round hosted on InternUtopia, a high-fidelity simulation platform, followed by real-world testing. This dual-phase design aims to drive innovation in both model architectures and training methodologies within the field.

Participants are required to utilize interactively collected data from either real-world or simulated environments to accurately interpret complex scenarios and make contextually appropriate decisions. The challenge explores key issues in transferring skills from simulation to reality, including domain gaps, multimodal information fusion (vision, language, and action), and scalable sim-to-real transfer techniques.

We hope this event will bring together researchers and practitioners from around the world to explore cutting-edge topics in multimodal robot learning, laying a solid foundation for the future development of intelligent robotics.

Track Introduction

Vision-Language Manipulation in Open Tabletop Environments

This track focuses on developing multimodal robotic manipulator capable of understanding and executing task instructions. Participants are required to design end-to-end control policy models that integrate visual perception, instruction following, and action prediction. The robot must operate within a simulated physics-based environment, using platforms such as robotic arms or mobile dual-arm systems, to carry out a variety of manipulation tasks. The challenges are rooted in open tabletop scenarios, diverse task instructions, and multiple manipulation skills.

Key Challenges Include:

  • Effectively fusing visual and linguistic information to drive a unified perception-decision-control pipeline;
  • Robustly interpreting natural language instructions and executing multi-skill manipulation behaviors on physically simulated robotic arms or mobile manipulators;
  • Achieving generalization at both the task and object levels to support diverse and long-horizon manipulation tasks in open tabletop environments.

Vision-and-Language Navigation in Physical Environments

This track focuses on developing multimodal mobile robot navigation systems with language understanding capabilities. Participants are required to devise a navigation agentthat is capable of egocentric visual perception and natural language instruction comprehension to trajectory history modeling and navigation action prediction. The agent will be evaluated in a realistic physics-based simulation environment, operating a legged robot (e.g., the humanoid Unitree H1) to perform indoor navigation tasks guided by language instructions. The system should be capable of handling challenges such as camera shake, height variation, and local obstacle avoidance, ultimately achieving robust and safe vision-and-language navigation.

Key Challenges Include:

  • Integrating visual and language inputs to drive a unified perception-decision-control pipeline;
  • Ensuring robust performance on a humanoid robot platform within a physics engine, especially under camera shake, dynamic height changes, and local obstacle interactions during walking;
  • Producing human-like navigation behavior to complete instruction-following tasks in complex indoor environments.
Process & Timeline

07/25

Competition Start & Materials Release

01

Registration

Go register >>

02

Download Materials

A baseline model, training datasets, and tutorials are provided to support model development.

Download the baseline model >>
TRACK1 \ TRACK2
Download the datasets >>
TRACK1 \ TRACK2
Download the Toturial >>
TRACK1 \ TRACK2

03

Model Training

Train your model using the provided data and baselines, or develop your own pipeline.

07/30

Test Server Open

09/30

Test Server Close

10/10

Results Announcement

Final rankings will be announced on October 10th.

04

Model Submission

Submit your solution to the online test server.

Submit the model >>
TRACK1 \ TRACK2

10/18

Onsite Challenge

05

Onsite Challenge

In each track, we will select up-to-8 teams for the onsite challenge.

10/20

Results Announcement

06

Workshop Day Champions Announcement

On October 20th, the winners will present their work at our workshop in Hangzhou, China.

Awards

1st

$10000

2nd

$5000

3rd

$3000

Travel Grant: $1,500

For each selected team from both tracks

Additional prizes and official certificates will also be awarded

Top performers can receive:

  • Internship opportunity at Shanghai AI Lab

  • Direct access to the campus recruitment interview, entering the JOB TALK stage

General Rules

Eligibility

A participant must be a team member and cannot be a member of multiple teams.

Participants can form teams of up to 10 members.

A team is limited to one submission account.

A team can participate in multiple tracks.

An entity can have multiple teams.

Attempting to hack the test set or engaging in similar behaviors will result in disqualification.

Technical

All publicly available datasets and pretrained weights are allowed.

Unauthorized access to test sets is strictly prohibited.

Award & Voucher

All participants that make a valid submission and submit their team name and related information before the leaderboard is opened will receive an electronic certificate of participation.

Teams must make their results public on the leaderboard before the submission deadline.

Code or Docker image must be opensourced.

Organizers reserve the right to update the rules or disqualify teams for violations. Winners will be awarded the following prizes (per track):

  • 1st Place:$10,000 cash prize, $1,500 travel subsidy, additional prizes, and a certificate
  • 2nd Place: $5,000 cash prize, $1,500 travel subsidy, additional prizes, and a certificate
  • 3rd Place: $3,000 cash prize, $1,500 travel subsidy, additional prizes, and a certificate
  • 4th–10th Place: Prizes and a certificate

Organizer:Shanghai AI Lab
Co-organizer:ManyCore Tech, University of Adelaide
Sponsors (order not indicative of ranking):ByteDance, HUAWEI, ENGINEAI, HONOR, ModelScope, Alibaba Cloud, AGILEX, DOBOT