[D] We scanned 18,000 exposed OpenClaw instances and found 15% of community skills contain malicious instructions

digitado ⋅ 12 de February de 2026

I do security research and recently started looking at autonomous agents after OpenClaw blew up. What I found honestly caught me off guard. I knew the ecosystem was growing fast (165k GitHub stars, 60k Discord members) but the actual numbers are worse than I expected.

We identified over 18,000 OpenClaw instances directly exposed to the internet. When I started analyzing the community skill repository, nearly 15% contained what I’d classify as malicious instructions. Prompts designed to exfiltrate data, download external payloads, harvest credentials. There’s also a whack-a-mole problem where flagged skills get removed but reappear under different identities within days.

On the methodology side: I’m parsing skill definitions for patterns like base64 encoded payloads, obfuscated URLs, and instructions that reference external endpoints without clear user benefit. For behavioral testing, I’m running skills in isolated environments and monitoring for unexpected network calls, file system access outside declared scope, and attempts to read browser storage or credential files. It’s not foolproof since so much depends on runtime context and the LLM’s interpretation. If anyone has better approaches for detecting hidden logic in natural language instructions, I’d really like to know what’s working for you.

To OpenClaw’s credit, their own FAQ acknowledges this is a “Faustian bargain” and states there’s no “perfectly safe” setup. They’re being honest about the tradeoffs. But I don’t think the broader community has internalized what this means from an attack surface perspective.

The threat model that concerns me most is what I’ve been calling “Delegated Compromise” in my notes. You’re not attacking the user directly anymore. You’re attacking the agent, which has inherited permissions across the user’s entire digital life. Calendar, messages, file system, browser. A single prompt injection in a webpage can potentially leverage all of these. I keep going back and forth on whether this is fundamentally different from traditional malware or just a new vector for the same old attacks.

The supply chain risk feels novel though. With 700+ community skills and no systematic security review, you’re trusting anonymous contributors with what amounts to root access. The exfiltration patterns I found ranged from obvious (skills requesting clipboard contents be sent to external APIs) to subtle (instructions that would cause the agent to include sensitive file contents in “debug logs” posted to Discord webhooks). But I also wonder if I’m being too paranoid. Maybe the practical risk is lower than my analysis suggests because most attackers haven’t caught on yet?

The Moltbook situation is what really gets me. An agent autonomously created a social network that now has 1.5 million agents. Agent to agent communication where prompt injection could propagate laterally. I don’t have a good mental model for the failure modes here.

I’ve been compiling findings into what I’m tentatively calling an Agent Trust Hub doc, mostly to organize my own thinking. But the fundamental tension between capability and security seems unsolved. For those of you actually running OpenClaw: are you doing any skill vetting before installation? Running in containers or VMs? Or have you just accepted the risk because sandboxing breaks too much functionality?

submitted by /u/Legal_Airport6155
[link] [comments]

Like 0

Liked Liked