Hard Krypton Exclusive Interview: WANG Zhongyuan, Dean of Beijing Academy of Artificial Intelligence

Hard Krypton Exclusive Interview: WANG Zhongyuan, Dean of Beijing Academy of Artificial Intelligence

Hard Krypton Exclusive Interview: WANG Zhongyuan, Dean of Beijing Academy of Artificial Intelligence

https://eu.36kr.com/en/p/3853016586359817

Publish Date: 2026-06-14 23:00:00

Source Domain: eu.36kr.com

Author | Qiu Xiaofen

Editor | Yuan Silai

In the past few months, the “World Model” has rapidly expanded from an academic jargon to a key term in the AI and robotics industries.

Behind the industry’s focus lies real anxiety.

On the one hand, after two years of wild growth, embodied intelligence has exposed the current shortcomings of AI in the physical world. Robots can recognize objects but don’t understand that “pushing a cup will make it fall”; they can understand instructions but can’t predict “how much force is needed to unscrew a bottle cap.” The world model aims to make up for this shortcoming, enabling robots to learn the laws and causality of the physical world.

In other words, the relationship between the world model and embodied intelligence is essentially the relationship between the “brain” and the “body.”

On the other hand, after exploring large language models, vision models, and multimodal models, large models need to move from the virtual world to the next stage in the real world.

However, when capital, technology experts, and industrial resources are all poured into this area, people have no answer as to how the world model will truly be applied.

In the view of Wang Zhongyuan, the director of the Beijing Academy of Artificial Intelligence (BAAI), the current global exploration of the world model is being torn into four distinct paths –

The first type is the language – centered world model, including VLM and VLA. These models predict the next word in the text space and learn the world described by language but cannot understand the underlying physical consequences.

The second type is the pixel – centered world model, such as video – generation models like Sora and Seedance. They learn videos or images in the visual space and learn the world described by pixels.

The third type is the 3D – structure – centered world model, including 3D reconstruction and the World Labs Marble model of Fei – Fei Li’s team. However, reconstructing a 3D space does not…

Source