Skip to content

openclaw

1 post with the tag “openclaw”

AI Agents Built My Security Framework

I’d been hearing a lot about OpenClaw — a self-hosted gateway that lets you message AI coding agents from your phone — and I honestly think it’s reckless. Not the technology itself, but how people use it. Users are handing over usernames, passwords, email credentials, credit card numbers — trusting an AI agent with the keys to their entire digital life. That concerns me.

But watching Jensen talk about it at GTC made me reconsider whether the tool itself was the problem, or just the way people deploy it. I have a family, I’m busy outside my day job at NVIDIA, and the idea of writing code by sending text messages to a bot was genuinely appealing. So I tried it — on my terms, with the system isolated from anything that matters.

The project I built is SafeAgentFramework — a pluggable, security-first Python framework that gates AI agent tool calls through an AWS IAM-style policy engine. Deny-by-default, explicit-deny-wins, every decision audited in application code, not prompts. The irony of using an autonomous AI agent to build a framework for constraining autonomous AI agents was not lost on me.

The part I didn’t expect: most of the implementation was done by GLM-5, an open model running on Together AI for about $30 in API credits. After review, the output is indistinguishable from what the proprietary models produced. And frankly, these models just write code better than I do. They see things I don’t see and are incredibly useful for evaluating and implementing new ideas.

I’ve been thinking about building agents for a while, but everything in the space feels too loose. The common approach to agent safety is prompt-based guardrails — “don’t delete files,” “don’t run dangerous commands.” We’re trying to bake security into the models themselves, which is easily overwritten. One clever prompt injection and your guardrails are gone.

SafeAgentFramework takes a different approach: treat the LLM the way we treat humans. We don’t trust humans blindly. We don’t hand over the keys to the kingdom to a new hire on day one. We work with partners, contractors, and open source contributors we don’t fully trust all the time — and we have mature frameworks for managing that. IAM policies, role-based access control, audit logs, least-privilege principles. These models are like having a PhD in your office who can be socially engineered easier than many 12-year-olds. Incredibly capable, incredibly dangerous depending on how much power you give them. Why not apply the same guardrails we already use for humans?

All authorization is enforced in application code through IAM-style JSON policy documents. The LLM never sees the policies, can’t probe them, and gets the same opaque error whether a tool doesn’t exist or it’s not authorized to use it.

%%{init: {'theme': 'default'}}%%
flowchart LR
    A["LLM tool call"] --> B["ToolDispatcher"]
    B --> C["PolicyEvaluator"]
    C -->|Denied| D["Generic error"]
    C -->|Allowed| E["Module.execute()"]
    B --> F["AuditLogger"]

Three concepts: Modules provide tools (filesystem, shell, etc.). Policies are JSON documents that allow or deny actions on resources. The ToolDispatcher is the single enforcement point — every tool call passes through policy evaluation before execution. No bypass, no skip-auth flags, every decision audited.

I had a Beelink S13 mini PC sitting around — 16GB RAM, 500GB disk, bought out of curiosity and never used for anything. I also had a backup T-Mobile LTE router I use when my internet goes down. I wired the Beelink to the T-Mobile router to keep it isolated from my home network, and set up Cloudflare tunnels for remote management.

I installed Ubuntu 24.04 Server on the Beelink, then deployed OpenClaw via Docker Compose to keep everything containerized. Added my OpenAI, Anthropic, and Together AI API keys and enabled the models I wanted in the OpenClaw config.

For chat interfaces, I tried the built-in web chat, Slack, and Discord. Slack was frustrating — you can’t pass OpenClaw commands like /new or /compact. Discord was the best experience by far, much easier to use. I avoided some of the other channel options because they felt insecure.

The development process used three distinct AI agents across two tools, orchestrated by me as engineering manager.

%%{init: {'theme': 'default'}}%%
flowchart LR
    A["Architect\nOpus 4.6 via Cursor"] --> B["Developer\nGLM-5 via OpenClaw"]
    B --> C["Senior Reviewer\nOpus 4.6 via OpenClaw"]
    C -->|Should-fix issues| B
    C -->|Clean| D["Engineering Manager\nHuman"]
    D --> E["Merge"]

Phase 1 — Architecture and issue creation. A Cursor agent (Opus 4.6) evaluated the codebase, identified problems, and created 21 detailed GitHub issues with prioritized recommendations and code examples. This is the planning layer — investing in a top-tier model for problem identification and solution design.

Phase 2 — Implementation. The OpenClaw agent handled the actual coding. It had its own dedicated GitHub account with SSH key — fork, branch, implement, PR, all autonomous.

My prompt for each issue was roughly:

“Please work on this github issue. You need to make sure that you rebase your fork. Then please write the code to address the issue in a feature branch off of your fork. Then open a PR against the upstream repo. Once done with the PR, work with senior engineer for PR review and have him post his feedback in the PR. Please try and work with him in a continuous loop until he no longer has should fix issues.”

The agent would occasionally forget to rebase the fork, requiring a re-prompt. I also ran out of API credits a few times and had to top up. Otherwise the workflow was largely autonomous.

Phase 3 — PR review. A “senior engineer” sub-agent within OpenClaw — a coding expert, security expert, and general IT expert — whose sole job was PR review. This agent always used Opus 4.6, regardless of which model did the implementation. The key prompt design: iterate in a continuous loop until the reviewer has no remaining “should fix” issues. This creates a convergence guarantee.

Phase 4 — Human review and merge. I manually reviewed each PR and merged. My role was steering and ship/no-ship decisions, not line-by-line coding.

The initial build used GPT-5.4/5.4-pro and Sonnet 4.6 for implementation, with Opus 4.6 for PR review.

Result: ~1,900 lines of production code across 18 source files, ~176 tests, and a full CI pipeline with Ruff, Bandit, mypy (strict), and pytest with coverage.

To test the quality, I had a separate Cursor agent (Opus 4.6) perform an independent code review — completely unrelated to the agents that wrote the code.

CategoryScore
Architecture & Design5/5
Code Quality4/5
Security Posture4/5
Type Safety5/5
Test Quality4/5
Documentation4/5
CI/CD3/5
Developer Experience3/5

The reviewer called the architecture document “production-grade” and the ToolDispatcher “the crown jewel” — a single enforcement point with no bypass, fail-closed on audit failure, and opaque error messages that prevent the LLM from learning about tool existence or policy structure through error probing.

The test suite impressed: path traversal prevention tests, information leakage verification, atomicity guarantees on registry operations, thread-safe audit logging, and fail-closed behavior when the audit logger fails. These are the kinds of security-focused tests often missing even in production codebases.

Code consistency was called “remarkable” — all 18 files follow identical patterns for imports, docstrings, error handling, and type annotations. The reviewer noted this is “unusual for multi-model agentic output” and credited the PR review process.

The review also found real problems:

  • ModuleRegistry.dispatch() bypasses the policy engine. A convenience method that calls module.execute() directly, skipping authorization entirely. A security foot-gun.
  • Shell module leaks the full host environment to subprocesses. os.environ.copy() as the base environment means every subprocess inherits API keys, tokens, database URLs — anything in the host’s environment variables.
  • No session TTL. Abandoned sessions accumulate in memory forever. In a long-running server, this is a slow memory leak.
  • The README still says “implementation has not started.” The codebase has 18 source files and 176 tests. This is a classic multi-agent coordination failure — the README was written early, and as implementation progressed, no agent was tasked with updating it.
  • chromadb and tiktoken are core dependencies but unused. The architecture doc mentions RAG capabilities that haven’t been implemented yet. The stub issue included them, and the review agent didn’t flag unused dependencies.

21 issues filed. All addressed. This is where it gets interesting.

The initial build burned through about $30 in Anthropic credits and $30 in OpenAI credits in roughly a day. That’s not sustainable. I looked into Together AI, which offers open models at much lower prices, and switched to them for the remediation work. I tried several models:

  • Kimi 2.5 — Reasonable for early issues.
  • Qwen 3.5 — Similar to Kimi, reasonable.
  • DeepSeek — Abandoned. High token consumption without proportional output quality.
  • GLM-5 — The standout. Used for the majority of the remediation work.

The review agent stayed constant: Opus 4.6 for every PR, regardless of which model wrote the code. This effectively created a controlled experiment — same quality bar, different implementation models.

MetricBeforeAfterDelta
Production code~1,900 lines~3,500+ lines+84%
Test count~176278+58%
Test files1214+2
CI coverage threshold0%85%
Issues closed21/21100%

The quality of the fixes is what matters. Three examples:

Session manager rewrite. Not a patch — a complete rewrite. OrderedDict for LRU ordering, configurable TTL (timedelta, float, or None), max sessions with LRU eviction, max messages with FIFO trimming, eviction callbacks with exception suppression, RLock for thread safety (not Lock, since add_message calls get() internally), lazy cleanup. 35 dedicated tests. This is production-grade code, written by GLM-5.

Shell environment security. The fix didn’t just block PATH — it blocked 22+ injection vectors across 7 languages and platforms: LD_PRELOAD, DYLD_*, BASH_ENV, PYTHONPATH, NODE_OPTIONS, PERL5OPT, RUBYOPT, and more. A minimal safe PATH replaced os.environ.copy(). An explicit allowlist pattern for environment variable passthrough. 10 dedicated security tests. This goes well beyond the minimum recommendation.

Tool name sanitization. OpenAI’s API rejects colons in tool names, but the framework uses module:tool format internally. The solution: sanitize_tool_name() and restore_tool_name() translate at the EventLoop boundary, keeping internals unchanged. But the implementation went further — adding a _validate_name constraint in ToolDescriptor that rejects "__" in tool names, making the sanitize/restore round-trip provably bijective. This prevents an entire category of bugs by design.

The independent Cursor agent reviewed the codebase again after all 21 issues were closed.

CategoryBeforeAfter
Architecture & Design5/55/5
Code Quality4/55/5
Security Posture4/55/5
Type Safety5/55/5
Test Quality4/55/5
Documentation4/55/5
CI/CD3/54/5
Developer Experience3/54/5
Overall4/55/5

But it also found new issues — and this is the honest, interesting part.

Session lock leak. The new TTL eviction in SessionManager creates a gap: when a session is evicted, the corresponding asyncio.Lock in EventLoop is orphaned. Each lock is small (~200 bytes), so this is a slow leak, but it accumulates. This was introduced by the TTL fix — a classic cross-cutting regression. Each PR is correct in isolation, but the interaction between them creates a gap.

README Quick Start example is wrong. The example instantiates LLMClient directly, but LLMClient is a typing.Protocol, not a concrete class. Anyone copying the example gets a TypeError. Same class of documentation-lag error as the original “implementation has not started” README — the agent writing docs didn’t verify the example actually runs.

PolicyStore.load() still not atomic. If the third of five policy JSON files has an invalid version, the first two are already appended. The store is left partially loaded. The staging-dict pattern that ModuleRegistry uses correctly wasn’t applied here.

Agentic workflows introduce the same class of cross-cutting regressions that human teams do. Issue-by-issue development is inherently local. The fix is the same for both: cross-cutting reviews after batches of related changes.

This is what surprised me most. My ranking of models for agentic code generation:

  1. Anthropic Opus 4.6 — The best reviewer I used. Kept constant as the PR review agent throughout. Worth every token for that role.
  2. Anthropic Sonnet 4.6 — Best implementation model overall, but expensive.
  3. GLM-5 — Very close to Sonnet 4.6. Much cheaper via Together AI.
  4. OpenAI GPT-5.4 Pro — Good, not as good as Sonnet 4.6, but capable enough. Burned through credits fast.
  5. Kimi 2.5 / Qwen 3.5 — Reasonable for simpler issues.
  6. OpenAI GPT-5.4 — Poor. Underperformed across the board.
  7. DeepSeek — Poor. High token burn, abandoned.

GPT-5.4 was the biggest disappointment. GPT-5.4 Pro was noticeably better but still not worth the cost compared to Sonnet 4.6. It suggests that model quality for agentic code generation doesn’t strictly follow general capability benchmarks. The ability to follow complex multi-step instructions — fork, rebase, branch, implement, test, open a PR, iterate on review feedback — may be a distinct skill from general reasoning.

DeepSeek’s failure is also informative: “open” doesn’t automatically mean “cheaper.” A model that consumes many tokens without producing usable output is more expensive than a capable model that gets it right in fewer iterations.

The total cost for the entire remediation cycle — 21 issues, ~19 PRs, 36 changed files — was about $30 in Together AI credits and $5-10 in Anthropic credits for the Opus 4.6 review agent. That’s roughly half what the initial build cost, for significantly more work.

This points to an optimal cost structure for agentic development:

  • Architecture/planning: Invest in a top-tier model (Opus 4.6 via Cursor) for design and issue specification.
  • Implementation: Use the best-performing open model via a cheap inference provider (GLM-5 on Together AI).
  • Review: Invest in a top-tier model (Opus 4.6) as the quality gate.
  • Human: Final review and merge only.

It’s more expensive than Cursor for the same amount of code, but the tradeoff is autonomy. When the agent just does its work — forks the repo, creates a branch, implements the fix, opens a PR, iterates on review feedback, all without you touching a keyboard — that’s a different kind of value.

The review loop is the quality gate, not the implementation model. GLM-5’s output, after iterating with the Opus 4.6 reviewer until all “should fix” issues were resolved, is indistinguishable from Sonnet 4.6’s output after the same process. A strong reviewer compensates for variance in implementation model quality. The analogy to human teams is exact: a senior tech lead’s review elevates code from junior developers.

Open models can produce professional-grade code. The majority of the remediation — 36 changed files, +5,144 lines — was done by GLM-5. After review, the code reads as if written by a single experienced developer. The assumption that frontier proprietary models are necessary for high-quality code generation needs revisiting.

Code consistency comes from the review, not the author. Despite being produced by 4+ different models across multiple PRs, the codebase has uniform imports, docstrings, error handling, and naming conventions. Ruff enforcement plus Opus 4.6 review is the normalizing force.

Agentic workflows have the same failure modes as human teams. Cross-cutting regressions, stale documentation, coordination gaps between independently-correct changes. The fixes are the same too: cross-cutting reviews after batches of related changes, documentation verification steps, better workflow design.

The human role shifts from writing code to engineering management. Architecture decisions, threat modeling, model selection, ship/no-ship judgment. The code itself is delegated. Whether that’s a good thing depends on what you think engineering is.

I still think OpenClaw is reckless — but the way I used it isn’t. The agent can install its own skills at will, and the platform encourages handing over credentials to email, messaging services, and more. I’m not comfortable with that. But my setup — an isolated mini PC on a separate network, running in Docker, with nothing on it except API keys and a GitHub SSH key for a dedicated bot account — contained the blast radius to almost nothing. The tool worked well for my limited use case. I just wouldn’t hand it the keys to my life.

Is SafeAgentFramework actually useful? Honestly, I’m not sure yet. It provides a level of audit and authorization that many agentic setups don’t have, but you could achieve similar results by running your agent as a restricted OS user, limiting API key permissions, and using container sandboxing. Whether a dedicated policy engine is necessary or just interesting is an open question. But it’s interesting enough that I’m going to keep working on it. Maybe I’ll use it for my own agents someday.

What’s missing: human in the loop. Allow and Deny are the bare minimum for constraining an agent, but real-world use cases aren’t always that binary. Sometimes you want the agent to do something semi-dangerous — delete a directory, run an unfamiliar command, push to a production branch — but only after a human approves it. The framework needs a third policy effect beyond Allow and Deny: something like HumanApproval that pauses execution, sends a notification via messaging or some other interface, and waits for explicit human approval before the agent can proceed. Standard I/O through the same chat channels you’re already using would be the natural fit. That’s probably the next big feature.


SafeAgentFramework on GitHub · OpenClaw