Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

https://the-decoder.com/oppo-open-sources-android-ai-agent-x-omniclaw-that-uses-your-camera-screen-and-voice-without-leaving-the-phone/

Publish Date: 2026-05-17 03:44:00

Source Domain: the-decoder.com

Oppo’s Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.

In the technical report, Oppo’s AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba’s Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can’t touch local sensors, cameras, or private data.

X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as “fuel” for higher-level reasoning when needed, the report says. It doesn’t name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.

X-OmniClaw’s full architecture runs on-device. Cloud models only provide “fuel” for complex reasoning, according to Oppo. | Image: Oppo

Camera, screen, and voice feed into a single pipeline

The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user’s request before triggering any action.

The perception stack pulls in text, voice, camera, and screen signals, syncs them up, and hands a structured intent to the language model.The perception stack combines text, voice, camera, and screen signals, aligns them in time, and passes a structured intent to the language model. | Image: Oppo

In the researchers’ example, a user asks “How much does this cost on Taobao?” while pointing the camera at a product. The system rephrases that internally to “price of Evian spray on Taobao” and only then hands the structured intent off for execution.

Photo gallery becomes searchable memory

For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.

The memory module crunches gallery photos during idle time into a Markdown file called Source