[D] We found 18K+ exposed OpenClaw instances and ~15% of community skills contain malicious instructionsc

digitado ⋅ 16 de February de 2026

Throwaway because I work in security and don’t want this tied to my main.

A few colleagues and I have been poking at autonomous agent frameworks as a side project, mostly out of morbid curiosity after seeing OpenClaw blow up (165K GitHub stars, 60K Discord members, 230K followers on X, 700+ community skills). What we found genuinely alarmed us.

We identified over 18,000 OpenClaw instances exposed directly to the public internet. But the scarier part: when we audited community built skills, nearly 15% contained what we’d classify as malicious instructions. We’re talking prompts designed to download malware, exfiltrate sensitive data, or steal credentials. And there’s this frustrating pattern where malicious skills get flagged, removed, then reappear under new identities within days. It’s endless.

The attack surface here is qualitatively different from traditional software vulnerabilities and I don’t think the ML community has fully internalized this. These agents have delegated authority over local files, browsers, and messaging platforms (WhatsApp, Slack, Discord, Telegram). A single compromised skill doesn’t just affect the skill’s functionality; it potentially compromises everything the agent can touch. Attackers don’t need to target you directly anymore, they target the agent and inherit its permissions.

Prompt injection is the obvious vector everyone talks about, but the supply chain risk from community skills is what’s actually keeping me up at night. Unlike npm packages or PyPI modules where there’s at least some security tooling and community review norms, agent skills are essentially unreviewed prompt bundles with execution capabilities. The OpenClaw FAQ itself acknowledges this is a “Faustian bargain” with no “perfectly safe” setup. At least they’re honest about it, but adoption is outpacing any reasonable security review.

There’s also this failure mode we’ve been calling “judgment hallucination” internally. Users anthropomorphize these systems and over delegate authority because the agent appears to reason competently. I’ve watched colleagues give these things access to their entire digital lives because “it seems smart.” The trust calibration problem is severe and I don’t see anyone working on it seriously.

I’ve been digging around for any standardized approach to evaluating agent security posture. Found some scattered resources like OWASP’s LLM guidelines, a few academic papers on prompt injection taxonomies, and stumbled across something called Agent Trust Hub that’s trying to catalog these risks. But honestly the whole space feels fragmented. We’re building the plane while flying it and nobody agrees on what the instruments should even measure.

Seriously though, has anyone here audited other agent frameworks like AutoGPT or BabyAGI for similar issues? And for those running agents in production, what does your threat model actually look like? I’m curious whether people are treating these as trusted code execution environments or sandboxing them properly.

submitted by /u/New-Needleworker1755
[link] [comments]

Like 0

Liked Liked