Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Source Domain: the-decoder.com

Oppo’s Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.

In the technical report, Oppo’s AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba’s Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can’t touch local sensors, cameras, or private data.

X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as “fuel” for higher-level reasoning when needed, the report says. It doesn’t name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.

X-OmniClaw’s full architecture runs on-device. Cloud models only provide “fuel” for complex reasoning, according to Oppo. | Image: Oppo

Camera, screen, and voice feed into a single pipeline

The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user’s request before triggering any action.

The perception stack combines text, voice, camera, and screen signals, aligns them in time, and passes a structured intent to the language model. | Image: Oppo

In the researchers’ example, a user asks “How much does this cost on Taobao?” while pointing the camera at a product. The system rephrases that internally to “price of Evian spray on Taobao” and only then hands the structured intent off for execution.

Photo gallery becomes searchable memory

For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.

Source

Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Camera, screen, and voice feed into a single pipeline

Photo gallery becomes searchable memory

I started using Google Keep with Gemini in Android Auto and stopped fumbling with my phone during drives

Android 17 Beta: List of phones eligible for Google’s new mobile OS

There are four “underrated” Gemini commands in Android Auto that change how you drive, from finding nearby places to controlling music and building lists, and the twist is that hands-free driving no longer depends on saying the exact magic phrase

Camera, screen, and voice feed into a single pipeline

Photo gallery becomes searchable memory

More Stories

I started using Google Keep with Gemini in Android Auto and stopped fumbling with my phone during drives

Android 17 Beta: List of phones eligible for Google’s new mobile OS

There are four “underrated” Gemini commands in Android Auto that change how you drive, from finding nearby places to controlling music and building lists, and the twist is that hands-free driving no longer depends on saying the exact magic phrase