
TL;DR: AI agents forget everything after context compaction. Every solution that asks the agent to “write notes” fails because the memory problem prevents its own solution. We built a 12-layer persistence stack using SQLite, automated cron extraction, scored behavioral rules, procedural flows, and a heartbeat pulse — all driven by one design principle: separate the note-taker from the worker. The system runs 9 agents autonomously, costs ~$5/day for memory infrastructure, and has survived 6 weeks of production use. Here’s every layer, every failure, and every line of SQL.
The Problem Nobody Talks About
There’s a dirty secret in AI agent infrastructure: your agents forget everything.
Not eventually. Not gradually. Abruptly. One moment your agent knows your entire project history, your preferences, your rejected ideas, your deployment procedures. The next moment — after context compaction or a session boundary — it asks “What are we working on?”
This isn’t a theoretical concern. It’s the #1 issue in the OpenClaw community: 45 hours of context lost to silent compaction. And it’s not unique to OpenClaw. Every framework that runs long-lived AI agents hits the same wall.
Context loss happens three ways:
- Compaction. The conversation gets too long, the system compresses it, and the agent loses detail. Your carefully negotiated decisions become “discussed options.”
- Session boundaries. Cron jobs, scheduled tasks, and new conversations start cold. The agent has zero context about what happened 30 minutes ago.
- The “fresh start” problem. The agent doesn’t know it lost context. It doesn’t ask to recover. It just… starts over.
The natural instinct is to fix this with note-taking. “Just have the agent write to a file after each conversation.” We tried this. Everyone tries this. It doesn’t work.
The agent that needs memory is the same agent that forgets to create it. Asking an amnesiac to maintain a journal works when they remember. They usually don’t.
The Core Insight: Separate the Note-Taker from the Worker
The breakthrough wasn’t a better vector database or a smarter summarization model. It was an organizational insight:
Don’t ask the surgeon to take their own notes during surgery. Have a scribe.
A lightweight cron job — the “court stenographer” — runs every 30 minutes, reads the recent conversation transcript, extracts structured state entries, and writes them to a SQLite database. The worker agent doesn’t need to remember to take notes. Someone else is always doing it.
After context loss, the agent queries the database instead of reloading full conversation logs. Recovery takes milliseconds and costs ~1K tokens instead of 70K+.
This single principle — automated extraction by an independent process — solved the problem that every other approach fails on. And it became the foundation for everything else we built.
The Architecture: 12 Layers of Persistence
Over six weeks of production use with 9 agents, we built a layered persistence stack. Each layer was added because the previous layers weren’t enough. Each one exists because something broke.
Here’s the final stack:
┌──────────────────────────────────────────────────────┐
│ THE PERSISTENCE STACK │
├──────────────────────────────────────────────────────┤
│ │
│ LAYER 12: THE CLOSED LOOP │
│ └─ Boot → pre-action → execute → post-action │
│ └─ Flow effectiveness tracking (pass/fail/%) │
│ └─ Nightly compaction: prune stale, decay unused │
│ │
│ LAYER 11: BACKTEST EXTRACTION │
│ └─ Replay sessions → extract flows + lessons │
│ └─ Seed production databases from historical data │
│ │
│ LAYER 10: UNIVERSAL BOOT (agent-boot.sh) │
│ └─ Rules + capabilities + lessons + flows + rejects │
│ └─ One script, all agents, identical context loading │
│ │
│ LAYER 9: COO AUDIT (morning optimization) │
│ LAYER 8: CROSS-AGENT LOOPS (data bridges) │
│ LAYER 7: NORTH STARS (self-directed missions) │
│ LAYER 6: DISPATCH (capability → agent → flow) │
│ LAYER 5: HEARTBEAT (periodic pulse) │
│ LAYER 4: AMBIENT MEMORY (auto-capture + recall) │
│ LAYER 3: PRE/POST-ACTION GATES │
│ LAYER 2: PROCEDURAL MEMORY (Flows DB) │
│ LAYER 1: BEHAVIORAL MEMORY (Rules Engine) │
│ LAYER 0: STRUCTURED MEMORY (State DB) │
│ LAYER -1: RAW TRUTH (Conversation Logs) │
│ │
├──────────────────────────────────────────────────────┤
│ PHILOSOPHY: Don't trust the agent to remember. │
│ Build systems that remember for it. │
└──────────────────────────────────────────────────────┘
I won’t explain all 12 — some are organizational rather than technical. But the first six layers form the core memory system, and they’re what you’d build first.
Layer 0: Structured Memory (The State DB)
The foundation. A SQLite database with structured entries for decisions, tasks, rejections, and agent state.
Schema
CREATE TABLE state (
id INTEGER PRIMARY KEY AUTOINCREMENT,
created_at TEXT DEFAULT (datetime('now', 'localtime')),
category TEXT NOT NULL,
domain TEXT DEFAULT 'GENERAL',
summary TEXT NOT NULL,
detail TEXT,
status TEXT DEFAULT 'active',
session_date TEXT DEFAULT (date('now', 'localtime')),
superseded_by INTEGER REFERENCES state(id)
);
CREATE TABLE watermarks (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at TEXT DEFAULT (datetime('now', 'localtime'))
);
CREATE INDEX idx_state_active ON state(status, id DESC);
CREATE VIEW v_active_state AS
SELECT id, created_at, category, domain, summary, detail
FROM state WHERE status = 'active' ORDER BY id DESC LIMIT 50;
CREATE VIEW v_rejected AS
SELECT id, created_at, summary, detail
FROM state WHERE category = 'rejected' AND status = 'active'
ORDER BY id DESC;
The Extraction Cron
This is the court stenographer. Every 30 minutes during active hours, a Sonnet-class model reads recent conversation and writes structured entries:
openclaw cron add \
--name "State Extractor" \
--cron "*/30 8-23 * * *" \
--agent main \
--session isolated \
--model sonnet \
--timeout-seconds 60 \
--no-deliver \
--message 'STATE EXTRACTION — Read recent messages.
For each meaningful state change, INSERT into state.db.
Categories: decision, task, rejected, agent_state,
context, discovery, learning.
Check watermark first. Update watermark after.
If nothing meaningful happened, just update watermark.'
Why Sonnet, not Haiku: The extractor needs to distinguish a “rejection” from a “preference.” “I don’t want that” is a rejection (score 9, never re-suggest). “Let’s try the other approach” is a decision (score 7, context only). Haiku misses the nuance.
Why every 30 minutes, not every 5: Balances cost vs freshness. Worst case: a 29-minute gap after compaction. The pre-compaction flush (below) mitigates this.
Why watermark-based: If three extraction runs fail in a row, the fourth picks up from the last successful watermark. Gaps auto-backfill. No data loss.
Categories
| Category | When to use | Why it matters |
|---|---|---|
decision | Something was approved or decided | Prevents re-litigating closed discussions |
task | Work started, assigned, or completed | Tracks what’s in flight |
rejected | Someone said NO to something | The most important category. Re-suggesting rejected ideas destroys trust faster than anything else. |
discovery | Found something important | Surfaces insights that would otherwise be lost to compaction |
learning | Lesson from a mistake | Seeds the behavioral rules engine (Layer 1) |
context | Mood, energy, priorities | Soft state that shapes how agents should operate |
Tiered Recovery
After compaction, agents don’t reload everything. They escalate:
Tier 0 (~1K tokens): SELECT * FROM v_active_state;
SELECT * FROM v_rejected;
↳ Enough for 80% of recovery cases.
Tier 1 (~10K tokens): Tier 0 + last 30 messages from session history
Tier 2 (~15K tokens): Tier 1 + daily recap document
Tier 3 (~70K tokens): Full conversation log reload (morning boot only)
Anti-loop protection: Loading too much context triggers another compaction, which triggers another recovery, which loads too much context. Tier 0 is the default because it’s cheap enough to never cause this.
Layer 1: Behavioral Memory (The Rules Engine)
The State DB captures what happened. But some learnings need to persist forever — behavioral rules that survive every session, compaction, and restart.
The problem: a flat rules file grows unbounded. After six weeks, we had 449 rules in one agent’s database. Loading all of them on boot is impossible.
The solution: scored, decaying rules where only the top 20 load on boot.
Schema
CREATE TABLE rules (
id INTEGER PRIMARY KEY AUTOINCREMENT,
rule TEXT NOT NULL,
domain TEXT DEFAULT 'ALL',
source TEXT, -- rejection, learning, discovery
score REAL DEFAULT 5.0, -- 1-10 importance
violations INTEGER DEFAULT 0,
last_reinforced TEXT,
created_at TEXT DEFAULT (datetime('now', 'localtime')),
status TEXT DEFAULT 'active'
CHECK(status IN ('active', 'dormant', 'retired', 'critical'))
);
CREATE VIEW v_boot_rules AS
SELECT id, rule, domain, score, status
FROM rules WHERE status IN ('active', 'critical')
ORDER BY score DESC LIMIT 20;
Scoring
| Score | Status | Behavior |
|---|---|---|
| 9-10 | critical | Always loaded. Never auto-decay. Only a human can retire. |
| 7-8 | active | Loaded on boot. Slow decay if unreinforced 30+ days. |
| 5-6 | active | Available on-demand. Moderate decay. |
| 3-4 | dormant | Not loaded unless searched. Fast decay. |
| 1-2 | retired | Effectively forgotten. Can be revived if re-triggered. |
Natural Selection
Every 24 hours, a compaction cron runs:
- Decay: Rules unreinforced for 7+ days lose 0.5 points per cycle. Critical rules (score 10) are exempt.
- Retire: Rules below score 1 are deleted. They had their chance.
- Merge: Near-duplicate rules (same first 40 characters) are consolidated. The higher-scored version survives with a +0.5 boost.
- Archive: Completed state entries older than 3 days move to archived status.
The result: the steady state is 60-100 rules regardless of how long the system runs. Important things stay. Unimportant things fade. Just like human memory.
The Promotion Pipeline
New rules don’t start at the top. They earn their place:
Conversation
→ State Extractor (30 min)
→ state.db (rejections, learnings)
→ Nightly promotion
→ rules table (score 5)
→ Reinforced by post-action gate
→ Score rises
→ Survives compaction
→ Loaded on boot
User rejections skip the line — they enter at score 9-10, status critical, immediately loaded on every boot.
Layer 2: Procedural Memory (The Flows DB)
State DB tells agents what happened. Rules Engine tells them what not to do. Neither tells them how to do something correctly.
We watched an agent deploy a blog post. It FTP’d directly to production (skipping staging), overwrote the blog index (destroying existing entries), and didn’t git commit. The next time it deployed — after context loss — it made different mistakes.
Every time an agent figures out a procedure from scratch, it’s a gamble. The Flows DB eliminates the gamble.
Schema
CREATE TABLE flows (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL,
domain TEXT DEFAULT 'ALL',
agent TEXT DEFAULT 'any',
trigger_phrases TEXT, -- comma-separated keywords
needs_approval BOOLEAN DEFAULT 0,
steps TEXT NOT NULL, -- JSON array of ordered steps
times_used INTEGER DEFAULT 0,
times_succeeded INTEGER DEFAULT 0,
times_failed INTEGER DEFAULT 0,
effectiveness REAL,
status TEXT DEFAULT 'active',
created_at TEXT DEFAULT (datetime('now','localtime'))
);
How It Works
Before any action, the agent runs a flow check:
$ bash tools/flow-check.sh "deploy to production"
=== FLOW: deploy-to-production ===
Step 1: BACKUP the target file with timestamp
Step 2: READ the existing file first. Understand what's there.
Step 3: EDIT surgically. Never rebuild from scratch.
Step 4: BUILD: run the build command
Step 5: GIT: add, commit, push
Step 6: HAND TO QA: Spawn QA agent with URLs to verify
Step 7: QA MUST PASS before reporting complete
Effectiveness: 94% (32/34 successful)
The flow is the procedure. The agent doesn’t need to remember how to deploy — the flow remembers for it.
Flow Effectiveness
Every flow tracks pass/fail outcomes. After a month:
- A flow at 95% effectiveness is a proven procedure. Trust it.
- A flow at 40% effectiveness is broken. Rewrite it.
- A flow that’s never been used is either poorly triggered or unnecessary. Retire it.
This is the difference between a recipe book and a recipe book with ratings. The system knows which procedures actually work — measured against real outcomes, not theoretical correctness.
Layer 3: The Pre-Action Gate (The Actual Breakthrough)
All three systems above — State DB, Rules Engine, Flows DB — share one fatal flaw: they only work if the agent checks them.
An agent with perfect memory that it never consults is an agent with no memory at all.
In practice, agents rush to execute. A task comes in and the agent’s next action is a tool call, not a context load. The flows are there. The rejection history is there. But the agent bypasses all of it because executing feels like progress.
This is the same reason humans skip checklists. It’s not ignorance — it’s urgency bias.
pre-action.sh
A single script that loads ALL relevant context before any action:
$ bash tools/pre-action.sh trading "place a live trade"
╔══════════════════════════════════════════════════╗
║ PRE-ACTION CONTEXT GATE ║
║ Agent: trading | Domain: TRADING ║
║ Action: place a live trade ║
╚══════════════════════════════════════════════════╝
━━━ FLOW CHECK ━━━
=== FLOW: trade-execution (effectiveness: 91%)
Step 1: Check market data
Step 2: Analyze chart
Step 3: Check options flow
Step 4: Size the position (risk rules)
Step 5: Place the order
Step 6: Set stops
Step 7: Log the trade
━━━ REJECTIONS (do NOT repeat these) ━━━
🚫 Never hold overnight without a stop loss
🚫 Do not trade during FOMC minutes
━━━ ACTIVE STATE ━━━
decision First live trade directive active
context Market volatility elevated
━━━ CRITICAL RULES (score ≥ 8) ━━━
rule score
-------------------------------------- -----
Verify the FILE not the REPORT 10.0
BEFORE deleting anything: read first 10.0
━━━ CAPABILITY CHECK ━━━
✅ IBKR Gateway: running
✅ Market data: connected
⚠️ Options scanner: last health check failed
━━━ GATE COMPLETE — proceed with context loaded ━━━
In one call (~3K tokens), the agent receives: the exact procedure, what NOT to do, what’s currently happening, hard-won rules, and which tools are working. The gate fires before every action. The agent doesn’t choose to check — the system forces it.
The Post-Action Gate
The other half of the loop. After every task:
$ bash tools/post-action.sh trading pass \
"Morning trade plan delivered" "trade-execution" "none"
━━━ POST-ACTION GATE ━━━
✅ PASS — Flow 'trade-execution' used (15 total, 91% effective)
♻️ State logged (consolidated with existing entry)
Gate 1: Grade. Pass or fail. Binary. Did the task succeed?
Gate 2: Update flow. Increment usage counters. Track effectiveness over time.
Gate 3: Log state. One consolidated entry, not five. If a similar entry exists, update it instead of duplicating.
Gate 4: Infrastructure check. Did you learn something? Did you change a script, flow, or schema? Or did you “just add a rule”? If you just added a rule, the lesson is not permanent. (More on why in the Meta-Lesson section.)
Layer 4: Ambient Memory (Push-Based Recall)
Layers 0-3 are all pull-based. The agent has to actively query them. After compaction, the agent doesn’t know it needs context. It’s waiting for input that never comes.
The ambient memory layer flips this. Using hybrid retrieval (vector + BM25 keyword search), the system automatically:
- Captures decisions, facts, and preferences from every turn — both user and assistant messages
- Before the agent sees each new message, retrieves the top 5 most relevant memories and injects them as context
- On session boundaries, distills the session’s learnings into durable memories with logistic decay
The agent doesn’t choose to search its memory. Memory comes to it. After compaction, the very first message triggers auto-recall, which surfaces the most relevant recent context.
Configuration
{
"memory": {
"captureAssistant": true,
"autoCapture": true,
"autoRecall": true,
"autoRecallTopK": 5,
"sessionStrategy": "memoryReflection"
}
}
Critical detail: We ran for 7+ hours with captureAssistant: false — only capturing user messages. But all the real context (decisions, plans, analysis, work output) lives in the assistant’s responses. The database had exactly 1 memory after 7 hours. Flipping captureAssistant to true was the fix.
Layer 5: The Heartbeat (Push-Based Recovery)
Every system above is still reactive. They respond to queries or fire on triggers. But what happens when the agent compacts mid-task and nothing triggers the next turn? The conversation goes silent. The agent doesn’t stall because it’s confused — it stalls because nothing wakes it up.
We spent an entire day building cron-based continuation that spawned isolated sessions — fresh agents with no memory trying to continue work they’d never seen. Dozens of ghost sessions burning tokens doing nothing useful.
The fix was the heartbeat: a periodic turn that fires in the main session, not an isolated spawn. Every few minutes, the heartbeat fires into the existing conversation with full context preserved. Combined with a context loader, each heartbeat:
- Loads recent conversation history and active state
- Checks for active task checkpoints
- If mid-task: continues working immediately
- If idle: replies HEARTBEAT_OK (suppressed, minimal cost)
The heartbeat is the clock signal. It fires regardless of what the agent knows or doesn’t know. It’s the external trigger that says “wake up, check your state, continue if needed.”
How It Compares to Everything Else
Persistent memory is the #1 problem in the agent infrastructure space. Here’s an honest comparison:
| Approach | How it works | Strengths | Fatal flaw |
|---|---|---|---|
| MEMORY.md (native) | Agent writes notes to markdown | Simple, always loaded | Agent forgets to write. The memory problem prevents its own solution. |
| Vector-based plugins | LanceDB/Chroma + hybrid retrieval + summarization | Semantic search, cross-session | Summarization is lossy. “Rejected the model downgrade” becomes “discussed options.” |
| ClawVault | CLI: explicit store/search/checkpoint | Semantic search, local-first | Manual. Requires agent discipline — same problem. |
| MCP Memory Server | Vector store via MCP, SSE transport | Fast reads, cloud sync | Cold start: model skips the “search memory” instruction. |
| Mem0 | Dynamic extraction + consolidation + vector retrieval | 26% accuracy gain, 90%+ token savings | Cloud service. Designed for chat apps, not agent orchestration. |
| Zep | Temporal knowledge graphs + business data | Tracks fact changes over time | Commercial platform. External dependency. |
| Anthropic Compaction | Provider-level automatic summarization | Zero-config | Lossy. No structured state. No “rejected” category. Agent doesn’t know what it lost. |
| Our State DB | SQLite + automated cron extraction + structured categories + tiered recovery | See below | New. Needs production hardening. |
What makes our approach different
Every other solution falls into one of two traps:
Trap 1: Relies on the agent to write memories. MEMORY.md, ClawVault, Obsidian — they all require the agent to take notes. But the agent that needs memory is the same agent that forgets to create it.
Trap 2: Relies on summarization (lossy compression). Vector plugins, compaction APIs, Mem0 — they all use LLMs to summarize conversations. Summaries lose detail. The rejection — often the most critical piece — is gone after compression.
We avoid both traps:
- Automated extraction, not agent discipline. An independent cron extracts state every 30 minutes. The worker doesn’t write its own notes.
- Structured data, not summaries. We extract rows with categories, domains, and statuses. You can query “show me all rejected suggestions” — try that with a summary.
- Rejections are first-class citizens. No other system tracks what someone said NO to. Re-suggesting rejected ideas is the fastest way to destroy trust.
- Tiered recovery prevents the infinite loop. Start with ~1K tokens and escalate only if needed.
- Zero infrastructure. SQLite. No vector DB, no cloud service, no API keys, no embedding models. A single file that works offline and is human-readable.
Research Support
- 65% of enterprise AI failures in 2025 were attributed to context drift/memory loss, not raw context limits (Zylos AI, 2026)
- Factory’s anchored iterative summarization (structured state updated incrementally) scored 4.04 vs 3.74 for full-reconstruction — our State DB is this pattern with structured data instead of prose
- JetBrains research found simple observation masking often outperforms LLM summarization while being cheaper — our structured extraction is masking (preserve the signal, drop the noise), not summarization
What Failed (And the Meta-Lesson That Changed Everything)
Anti-Patterns
-
“Just write to a file after each conversation.” Agents forget. The memory problem prevents its own solution. We tried this first. Everyone does.
-
“Read the full conversation log after compaction.” 70K+ tokens causes another compaction, which triggers another recovery, which loads 70K tokens… infinite loop.
-
“Summarize the day at the end.” By “end of day,” context is already lost. The summary agent is summarizing from a compacted, lossy context.
-
“Use a TODO list.” Same as #1. Nobody maintains it. The TODO list is empty within 48 hours.
-
“Spawn a fresh agent to continue the work.” We did this for an entire day. Dozens of isolated cron sessions, each with zero context, each burning tokens rediscovering what the main session already knew.
The Meta-Lesson: Why Lessons Don’t Stick
This is the most important section.
We logged this lesson on day one: “ALWAYS check capabilities before installing new tools.”
We reinforced it a week later: “Always search memory before asking or researching.”
Nineteen days after the first lesson, we discovered that an agent was reporting API keys as missing (they were in the keychain), service accounts as needed (they’d been configured for weeks), and capabilities as “not built” (they existed in other agent workspaces).
The lessons were recorded. They were never made infrastructure. A rule that says “check first” is behavioral compliance. The agent has to choose to read the rule, choose to follow it, and execute correctly every time. Over 19 days, this failed 100% of the time.
The Fix
The fix was not another rule. The fix was adding an Existing Tools Check directly into pre-action.sh that automatically:
- Searches every wrapper script by keyword
- Searches every agent’s tools directory
- Checks the macOS Keychain for matching credentials
- Displays everything found before the agent starts working
The agent doesn’t choose to check. The gate checks for it. The lesson became infrastructure.
The Pattern
This is the same pattern at every layer:
- Rules decay. A score-10 rule was logged on day one. Nineteen days later, the behavior hadn’t changed.
- Reflections feel like action. Writing “I’ll check next time” feels productive but changes nothing.
- Infrastructure persists. Adding the check to a script means every agent, every session, every task sees existing tools automatically.
The post-action gate now has a fourth check that asks: “Did you change a script, flow, or schema? Or did you just add a rule?” If the answer is “just a rule,” the lesson is flagged as not permanent. The gate challenges the agent to convert the lesson into infrastructure.
The lesson about checking capabilities was itself an example of a lesson that didn’t stick because it wasn’t infrastructure. The recursion is intentional.
Agent Autonomy: North Stars, Feedback Loops, and Self-Direction
Once agents can remember, the next question is: can they work without being told what to do?
North Star Missions
Each agent gets a permanent directive baked into their configuration — a mission that survives every restart:
## NORTH STAR MISSION
This site must rank #1 on Google for its target keywords.
### Self-Directed Work
You do NOT need to wait for the coordinator. Every night:
1. Check analytics for new data
2. Identify gaps: keywords with impressions but no clicks
3. Audit existing content: are titles and metas optimized?
4. Create NEW content targeting unowned keywords
The North Star isn’t a task list — it’s a permission slip. It tells the agent: “You know your domain. If you see something that needs doing, do it.”
Our content agent immediately self-directed: found 12 dead posts, fixed 33 duplicate H1 tags, audited internal links (8 orphans, 14 weak, 48 healthy) — all without a single instruction from the coordinator.
Cross-Agent Feedback Loops
Autonomous agents are only useful if they can see each other’s work. Our intelligence agent pulls analytics data every night. Our content agent publishes posts every night. For weeks, neither knew the other existed.
The fix was embarrassingly simple: filesystem symlinks + staggered crons.
Intelligence agent writes: Content agent reads:
memory/analytics/ memory/seo/analytics/ → (symlink)
weekly-report.md ←──────────────────────────┘
| Time | Agent | Action |
|---|---|---|
| 10:00 PM | Intelligence | Pull analytics → write report |
| 11:15 PM | Content | Read intelligence data → optimize top posts → deploy |
| 12:30 AM | Content | Write new content based on gaps |
No API. No message passing. No coordinator involvement. The intelligence agent writes data, the content agent reads it. Staggered crons ensure the data flows in the right order.
The Standard
Every agent needs three things to self-direct:
- A North Star (what they’re trying to achieve)
- Loaded capabilities on boot (what they can do)
- Data access (what the current state is)
Remove any one and the agent reverts to waiting for instructions.
Results
Over six weeks of production use with 9 agents:
- Context recovery: ~1K tokens via State DB query, down from 70K+ full log reload
- Rule convergence: Systems stabilize at 60-100 active rules per agent regardless of total lessons generated
- Flow library: 48+ proven procedures extracted from production and backtesting, with effectiveness tracking
- Extraction coverage: Watermark-based gap recovery means zero data loss even when cron runs fail
- Self-directed work: Agents with North Star missions identified and fixed issues (dead posts, broken links, missing schemas) without human instruction
- Cross-agent coordination: Staggered cron + symlink pattern eliminated coordination overhead for recurring data flows
Cost
| Component | Frequency | Est. Daily Cost |
|---|---|---|
| State extraction cron | Every 30 min, active hours | ~$2-4 |
| Pre-compaction flush | On compaction events | ~$0.50 |
| Heartbeat (idle) | Every 5-10 min | ~$15-45 (model-dependent) |
| Auto-capture embeddings | Every turn | ~$0.10 |
| Nightly compaction | Once daily | ~$0.25 |
| Total memory infrastructure | ~$18-50/day |
The heartbeat is the expensive component. At 5-minute intervals with a frontier model, it runs ~$45/day. At 10-minute intervals with a cheaper model, ~$15/day. Either way, compare to the cost of an agent re-discovering things, re-asking questions, and wasting human time re-explaining decisions.
Honest Limitations
This system is not done. Here’s what we know doesn’t work perfectly:
-
30-minute extraction gap. If compaction happens 29 minutes after the last extraction, you lose up to 29 minutes of state. The pre-compaction flush mitigates this but depends on the agent cooperating — the same cooperation problem we’re trying to solve.
-
Extraction quality depends on the model. Sonnet-class models are good but not perfect. Subtle rejections or implied decisions get missed. “I’d rather not” might not register as a rejection.
-
No semantic search in the State DB. It’s structured queries only. For “what did we discuss about deployment last week?” you still need vector search. The State DB answers “what’s active right now?” — a complementary role, not a replacement.
-
Rule decay is approximate. The 7-day decay window works in practice but may need tuning. Some rules fade too fast; others linger too long. We haven’t found the perfect constants yet.
-
The heartbeat is expensive. $15-45/day for a periodic pulse is significant for hobbyist use. Production deployments can absorb it; side projects can’t.
-
We’ve been “close” before. Six weeks ago we added the State DB and thought the problem was solved. Four weeks ago we added the Rules Engine. Two weeks ago, the Flows DB. Last week, the heartbeat. Each layer was necessary because the previous layers weren’t enough. There may be a Layer 13 waiting to be discovered.
-
The meta-lesson is recursive. We know that lessons don’t stick unless they’re infrastructure. We’ve built infrastructure to enforce this. But the enforcement itself is a behavioral pattern (running post-action.sh) that the agent might skip. Turtles all the way down.
The Philosophy
Every system we built addresses the same root cause: AI agents are not reliable enough to remember to remember.
- The State DB automates state extraction (the agent doesn’t write its own notes)
- The Rules Engine promotes lessons automatically (the agent doesn’t self-improve)
- The Capabilities DB is maintained by a separate cron (the agent doesn’t catalog itself)
- The Flows DB encodes procedures once (the agent doesn’t re-derive steps)
- The Pre-Action Gate loads all of it automatically (the agent doesn’t choose to check)
- The Post-Action Gate grades and consolidates (the agent doesn’t self-assess)
- Nightly Compaction prunes and sharpens (the agent doesn’t curate its own knowledge)
At every layer, the design principle is the same: remove the agent from the responsibility of maintaining its own memory. External systems observe, extract, store, and serve. The agent just shows up and gets context.
This isn’t a criticism of LLMs. It’s a recognition of where they are today. Language models are extraordinary at reasoning, generation, and tool use. They are terrible at maintaining state across discontinuities. That’s not a model problem — it’s an infrastructure problem. And infrastructure problems have infrastructure solutions.
The “court stenographer” metaphor captures the whole philosophy: don’t ask the surgeon to take their own notes during surgery. Build a scribe. Build it in SQLite. Run it on a cron. And make sure it tracks rejections, because the fastest way to destroy trust in an AI system is to re-suggest something the human already said no to.
That’s it. That’s the whole thing.
Getting Started
If you want to try this yourself, start with Layer 0 (State DB) and the extraction cron. That alone solves 80% of the problem. The schemas are MIT-licensed and work with any SQLite client. The extraction cron runs on OpenClaw but the pattern — independent process reads conversation, writes structured state — works with any agent framework.
Add layers as you need them. You probably won’t need all 12.
Built at iEnable — the AI workforce platform. If you’re building agents that need to remember, we’re building the infrastructure to make that happen. Get in touch.