Thorin

Enter password to continue

Skip to content

Agent Harnesses & Long-Horizon Patterns

Study material on how production agents are orchestrated — the harness layer that sits between the model and the real world. Covers Anthropic, Cursor, Codex, Manus, Devin, Claude Code, and Aider.

The core framing: the model is the CPU, the context window is RAM, and the harness is the operating system.


1. What Is a Harness?

The harness curates context, dispatches tools, manages permissions, and orchestrates the agent loop (read, plan, act, observe). Manus spent six months and five complete architectural rewrites on theirs, coining the process "Stochastic Graduate Descent." The harness is the product — not the model.

Key takeaway for interviews: "The harness matters more than the model. Stronger models reduce harness complexity but never eliminate it. Anthropic explicitly notes that Opus 4.6 made some harness components unnecessary — but the harness still exists."


2. Anthropic's Harness Patterns

2.1 The Two-Agent Pattern (Long-Running Tasks)

From Effective Harnesses for Long-Running Agents:

Problem: Agents work in discrete sessions; each new session starts with no memory. Even Opus 4.5 fails at sustained complex tasks with only high-level prompts.

Solution — two specialized agents:

  • Initializer Agent (first session only): Creates init.sh for environment bootstrapping, establishes claude-progress.txt as a session-to-session log, makes an initial git commit, and generates a comprehensive feature requirements file in JSON format (200+ discrete features with category, description, verification steps, and a passes boolean).
  • Coding Agent (all subsequent sessions): Reads git log + progress file to get up to speed, picks the next incomplete feature, implements it, commits, updates the progress file. Constrained to one feature per session.

The human-shift metaphor: Effective software teams use shift notes, git histories, and feature checklists for handoffs between engineers. The same mechanisms enable agent continuity across context boundaries. Each agent session is a "shift" that must leave artifacts for the next shift.

Feature requirements file rules: Agents may only flip the passes field. They cannot remove or edit tests. This prevents premature victory declarations.

2.2 The Three-Agent Pattern (Application Development)

Extends to a planner / generator / evaluator architecture:

  • Planner Agent: Converts 1-4 sentence user prompts into comprehensive product specs. Deliberately avoids granular technical details to prevent cascading errors.
  • Generator Agent: Implements features iteratively in sprints with git version control.
  • Evaluator Agent: Uses Playwright MCP for interactive testing — clicking through UIs, exercising APIs, verifying database states. NOT static code review.

Critical insight: "Tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work."

Cost tradeoff: 20x more expensive ($9 solo vs $200 harness) but dramatically superior output quality.

2.3 Context Engineering Principles

From Effective Context Engineering for AI Agents:

  • Context rot: Performance degrades as token count increases (n-squared attention relationships). Bigger windows don't solve this.
  • Just-in-time retrieval: Maintain lightweight identifiers (file paths, URLs), dynamically fetch when needed. Don't pre-load everything.
  • Compaction: Summarize conversation history when approaching limits. Tool result clearing is the "safest, lightest touch."
  • Structured note-taking: Persist notes externally (todo lists, NOTES.md) for state across context resets.
  • Sub-agent architectures: Sub-agents handle focused tasks with clean context, returning condensed 1,000-2,000 token summaries.
  • Guiding principle: "Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome."

3. Claude Code's Harness (12 Patterns)

Memory & Context

  1. CLAUDE.md: Project-level config loaded automatically — build commands, architecture rules, coding standards. Ships with the repo.
  2. Scoped Context Assembly: Instructions load from multiple hierarchy levels (org, user, project root, parent dirs, child dirs).
  3. Tiered Memory: Compact index (capped at 200 lines) always in context; topic files load on demand; full transcripts on disk for search.
  4. Dream Consolidation: Background "autoDream" mode merges duplicates, prunes contradictions during idle time.
  5. Progressive Compaction: Four layers — HISTORY_SNIP, Microcompact, CONTEXT_COLLAPSE, Autocompact. Auto-triggers at ~98% context usage.

Workflow & Orchestration

  1. Explore-Plan-Act Loop: Three phases with increasing write permissions.
  2. Context-Isolated Subagents: Separate context windows, system prompts, restricted tools. Only summaries return to parent.
  3. Fork-Join Parallelism: Multiple subagents in parallel on isolated git worktrees. Parent's cached context reuses across forks.

Tools & Permissions

  1. Progressive Tool Expansion: Starts with less than 20 tools; MCP/remote tools activate on demand.
  2. Command Risk Classification: Deterministic pre-parsing with per-tool allow/ask/deny rules.
  3. Single-Purpose Tool Design: FileReadTool, FileEditTool, GrepTool, GlobTool instead of raw shell.
  4. Deterministic Lifecycle Hooks: 25+ hook points (PreToolUse, PostToolUse, SessionStart) for invariant behaviors outside the prompt.

4. Cursor's Harness

4.1 Self-Summarization (Context Window Management)

Cursor's signature technique for long-horizon tasks:

  • Composer generates tokens until a fixed token-length trigger (40k-80k tokens)
  • A synthetic query asks the model to summarize its own context
  • The condensed context (summary + conversation state) feeds into the next cycle

Key innovation: Summarization is baked into the RL training loop. Each training rollout chains multiple generations via summaries. The final reward applies to ALL tokens in the chain, so both agent responses and summaries get credit/blame.

Results: 50% error reduction vs baseline compaction, using 1/5th the tokens. Real demo: 170 turns, compressing 100,000+ tokens into a 1,000-token summary.

4.2 Planner-Worker Hierarchy (Scaling Agents)

Cursor's evolution through three coordination architectures:

  1. Flat agents with shared files (FAILED): Lock bottlenecks reduced 20 agents to throughput of 2-3. Agents became risk-averse.
  2. Optimistic concurrency control: Simpler but didn't fix deeper coordination problems.
  3. Planner-Worker hierarchy (SUCCESS): Planners continuously explore the codebase, generate tasks, spawn sub-planners (making planning parallel and recursive). Workers focus exclusively on completing assigned tasks. A Judge agent evaluates progress.

Scale achieved: Web browser from scratch (~1M lines, 1,000 files, ~1 week). Hundreds of concurrent workers pushing to one branch. ~1,000 commits/hour across 10M tool calls.

Key lessons:

  • "The prompts matter more" than the harness or models
  • Removing a "quality control integrator" role actually improved performance
  • Periodic fresh starts needed to combat drift and tunnel vision
  • Accept bounded error rates — requiring 100% correctness caused gridlock

4.3 Model-Trained Harness

Unlike other agents that wrap generic models, Cursor trains their Composer model on tool-use trajectories. Each frontier model gets a tailored harness with model-specific instructions and tool definitions. This means the harness and model are co-optimized rather than independently developed.

4.4 Secure Codebase Indexing

  • Merkle tree of file content hashes for efficient change detection (50K files → only changed branches resync)
  • Files split into syntactic chunks, converted to embeddings for semantic search
  • Similarity hash (simhash) finds existing teammate indexes in vector DB for reuse
  • Privacy model: Server filters results by checking hashes against client's tree

Performance: Median repos: 7.87s → 525ms. 99th percentile: 4.03 hours → 21 seconds.

4.5 Dynamic Context Discovery

Five strategies for reducing upfront context:

  1. Tool responses as files — write long outputs to files, agents use tail to check progressively
  2. Chat history for summarization — agents get references to history files
  3. Agent Skills standardSKILL.md files discovered via grep/semantic search
  4. MCP Tool Discovery — tool descriptions stored in folders, loaded on demand. A/B test: 46.9% reduction in total agent tokens
  5. Terminal session files — terminal output synced to filesystem for grep access

4.6 The Third Era Vision

Era 1 = Tab (autocomplete), Era 2 = Agents (synchronous), Era 3 = Cloud Agents (autonomous, long-horizon, async). At Cursor, 35% of merged PRs already come from autonomous cloud agents.


5. OpenAI Codex's Harness

Sandbox model: Each task runs in its own cloud sandbox preloaded with the repository. Shell/file tools execute inside the sandbox.

Plans as first-class artifacts: Not ephemeral chat messages but execution plans with progress and decision logs checked into the repository. Active plans, completed plans, and known tech debt are all versioned and co-located. This lets agents operate without external context — everything is in the repo.

Key finding: Dropping reasoning traces from GPT-5-Codex caused a 30% performance drop. Reasoning preservation is critical.


6. Manus's Harness

The most detailed public context engineering writeup from any production agent:

KV-Cache as THE Critical Metric

100:1 input-to-output token ratio makes prefix caching essential. Cached tokens on Claude Sonnet cost $0.30/MTok vs $3/MTok uncached (10x savings). Design rules:

  • Stable prompt prefixes (no timestamps that invalidate cache)
  • Append-only contexts — never modify prior actions/observations
  • Deterministic JSON serialization

File System as Unlimited Context

"The file system is unlimited in size, persistent by nature, and directly operable by the agent." Large observations (web pages, PDFs) offload to sandbox storage with restorable references (URLs/paths preserved).

todo.md Attention Manipulation

Complex tasks auto-create a todo.md, updated step-by-step. Constantly rewriting it "pushes the global plan into the model's recent attention span," combating lost-in-the-middle problems. Averaging 50 tool calls per task, constant recitation keeps objectives salient.

Error Preservation

Failed actions and stack traces STAY in context. Models "implicitly update internal beliefs" from observing failures. Error recovery is "the clearest indicator of true agentic behavior."

Tool Management via Logit Masking

Rather than dynamically adding/removing tools (which breaks KV-cache), use context-aware state machines with response prefill modes. Tools named with consistent prefixes (browser_, shell_) for efficient masking.


7. Devin's Harness

Full-environment sandbox: terminal, code editor, AND browser. Reads API docs, looks up StackOverflow, runs shell commands autonomously. v3.0 (2026): Dynamic re-planning — alters strategy on roadblocks without human intervention. Devin 2.0: Multiple parallel instances in isolated VMs.


8. Aider's Harness

AST-based context: Instead of treating code as text, Aider parses Abstract Syntax Trees. Generates a repo map — function signatures and repository structure as primary context. Keyword matching, dependency analysis, and relevance scoring for automatic context prioritization.


8.4 OpenClaw (The Viral Messaging-Platform Agent)

Peter Steinberger's autonomous agent that runs in messaging platforms (Slack, Discord, Telegram, etc.) as its primary UI. Originally called Clawdbot → Moltbot → OpenClaw. Went from zero to 247K GitHub stars in two months (early 2026). Steinberger joined OpenAI in February 2026; a non-profit foundation took over stewardship.

The core idea: the agent lives where you already are (your messaging app), not in a terminal or IDE. You just message it like a colleague: "deploy the latest branch," "summarize this thread," "fix the failing CI job." It responds async, executes autonomously, reports back.

The architecture: OpenClaw wraps Pi (Section 9) as its agent engine. So the internal harness is Pi's radically minimal 4-tool setup; OpenClaw adds the messaging integration layer, auth, and async task dispatch on top. Armin Ronacher (Flask creator) uses Pi almost exclusively because of OpenClaw.

The "software building software" principle: same as Pi — if you want new capabilities, you don't install plugins, you ask the agent to write them. OpenClaw's Slack integrations were themselves generated by the agent to its creator's spec.

How to talk about it: "OpenClaw is the 'agents everywhere' bet — if the agent lives in Slack where I already work, I don't need to context-switch to a terminal to invoke it. Pairs with the minimal-harness philosophy: the UI is messaging, the engine is Pi's 4-tool minimal setup, and capabilities are self-extending. Steinberger getting hired by OpenAI after this is probably the cleanest signal of where the field thinks this model is heading."


8.5 GStack (The "Vibe-Orchestrated" Workflow)

Garry Tan (YC CEO) dropped GStack on March 12, 2026 — 11K GitHub stars in 48 hours. Not a framework; a workflow of role-switched Claude Code sessions (CEO review → architecture → implementation → code review → QA), orchestrated via Conductor (a Mac app running multiple Claude Code instances in isolated git worktrees).

The reception: a lot of the serious engineering community thinks it's AI-psychosis-tier LARPing — a CEO role-playing as a conductor of AI underlings, dressing up basic parallel Claude Code sessions as a "stack" because it had his distribution behind it. The "technical contribution" is a handful of markdown prompts telling Claude to act like a CEO or a QA engineer. Nothing Anthropic's orchestrator-workers pattern didn't already describe.

Worth knowing about mostly because it's a cultural data point — it got enough attention that Asanka may have seen it. The substantive critique worth having ready: the role-switching ritual is cosplay for a capability (parallel sub-agents in worktrees) that Claude Code already provides natively, and calling it a "stack" when it's a prompt template with a marketing push is part of the broader 2026 trend of hyping workflow repackaging as technical innovation.

How to talk about it: "GStack is mostly a meme — role-play prompts on top of Claude Code's existing worktree and sub-agent features. The underlying pattern (parallel agents on isolated branches) is real and useful, but the packaging is pretty thin. It's a cultural signal about how confused the discourse is right now more than a technical contribution."


9. Pi's Harness (The Minimal Counter-Argument)

Pi (Mario Zechner, of libGDX fame — site: shittycodingagent.ai) is the deliberate opposite of the heavy-harness trend. Where Claude Code ships 20+ tools, lifecycle hooks, and multi-layer compaction, Pi ships four tools (read, write, edit, bash), a sub-1,000-token system prompt, and an extension system that lets users build everything else themselves.

The thesis: Frontier models are already heavily RL-trained on coding agent patterns. They don't need elaborate harnesses telling them how to be agents — they need clean context and good tools. Everything else is noise that eats your context window and reduces observability.

Deliberate omissions (and why):

  • No MCP support — "MCP servers dump entire tool descriptions...7-9% of context window gone." Pi prefers CLI tools with progressive disclosure (agents pay token costs only when they read docs).
  • No sub-agents — Creates unobservable black boxes. Pi prioritizes full visibility of what the agent is doing.
  • No built-in to-do tracking — Models confuse external state management with actual progress. File-based plans are more observable.
  • No plan mode — Same reasoning; plans belong in files, not harness state.
  • No permission popups — Full YOLO mode by default. Zechner's argument: once you give an agent code execution, true security is impossible anyway. Accept it and optimize for speed.

What it does have:

  • Cross-provider handoffs — Switch models mid-session (Anthropic → OpenAI → Google). Thinking traces convert between formats transparently.
  • Tree-structured sessions — Conversation history as trees, not lists. Branch, backtrack, fork side-quests without cache invalidation.
  • TypeScript extension system — Extensions subscribe to events, register tools/commands/shortcuts, render custom TUI components. Users ask Pi to build its own extensions rather than downloading pre-built ones.
  • Packages ecosystem — Extensions, skills, prompt templates, and themes bundled as npm packages.

The OpenClaw connection: Armin Ronacher (Flask creator) uses Pi as the engine for OpenClaw. His usage pattern is notable — he doesn't write extensions himself; he asks Pi to create them to his specifications. Extensions like /review (code review in branched sessions), /todos (task management), /files (file picker). The agent builds and maintains its own tooling.

Benchmark validation: On Terminal-Bench 2.0, Pi is competitive with far heavier harnesses. Notably, a raw tmux terminal agent ("Terminus 2") also ranks highly — suggesting that minimal approaches can match sophisticated ones when the model is strong enough.

Key takeaway for interviews: Pi is the strongest evidence for the "the model is the product, not the harness" position. It's a useful counterpoint to Manus (5 rewrites, elaborate KV-cache optimization) and Cursor (model-trained harness, RL-tuned summarization). The truth is probably somewhere in between — but being able to articulate both positions shows range.


10. Agent Skills — The Cross-Platform Capability Layer

Agent Skills are an open standard (published by Anthropic, December 2025) for packaging reusable agent capabilities as filesystem-based instruction sets. They solve a different problem than MCP tools: where MCP gives agents access to actions (search, download, create), Skills give agents access to procedural knowledge (how to deploy a Rails app, how to review code at your company, how to write tests in your stack).

10.1 What a Skill Looks Like

A skill is a directory with a SKILL.md file:

.claude/skills/deploy/
  SKILL.md          # YAML frontmatter + markdown instructions
  scripts/          # Optional: deterministic scripts the agent can run
  references/       # Optional: docs loaded into context via Read tool
  assets/           # Optional: templates, referenced by path only

The SKILL.md has two parts — YAML frontmatter (name, description, allowed-tools, optional model override) and markdown content (step-by-step instructions, examples, constraints). The description is the primary signal for discovery: Claude reads it and decides whether to invoke the skill based on the user's intent.

yaml
---
name: deploy
description: Deploy the current branch to production via Docker Compose on the Olares box. Handles rsync, build, health checks.
allowed-tools: Read,Write,Bash
---

# Deploy to Production

1. Run `git status` to confirm clean working tree
2. Rsync to olares-ebook (exclude .git, node_modules, .env)
3. SSH and run `docker compose -f deploy/docker-compose.app.yml up -d --build`
4. Verify health endpoint returns 200
...

10.2 How Discovery Works

Skills use progressive disclosure — three tiers of context cost:

  1. Metadata (always loaded): Name + description (~50-100 tokens per skill). Loaded into the system prompt at session start. Claude sees all available skills immediately.
  2. Main content (loaded on invocation): The full SKILL.md instructions (500-5,000 tokens). Only loaded when Claude decides to invoke the skill.
  3. Reference files (loaded on demand): Supporting docs in the references/ directory. Loaded via Read tool only when the instructions reference them.

This is the same principle as Tool Search Tool's deferred loading — pay context costs only when needed. The difference: Tool Search uses API-side search, while Skills use filesystem navigation. Skills discovery is pure LLM reasoning on descriptions, not algorithmic matching.

10.3 Skills vs. MCP Tools vs. CLI

DimensionMCP ToolsCLIAgent Skills
What they provideActions (search, create, delete)Commands (git, curl, docker)Procedural knowledge (how to do X)
Context costSchema tokens per toolNear-zero (training data)~100 tokens metadata; full instructions on demand
Discoverytools/list or Tool SearchTraining data priorDescription matching at session start
ExecutionStructured tool callsShell commandsInjected instructions that guide the agent's behavior
PortabilityServer-specificUnix-universalCross-platform open standard

The key insight: Skills don't execute anything themselves. They inject instructions into the agent's context and modify its execution permissions (allowed tools, model). The agent then uses its existing tools (Bash, Read, Write, MCP) to carry out the instructions. Skills are meta — they teach the agent how to use its tools for a specific workflow.

10.4 Cross-Platform Adoption

As of 2026, Agent Skills are supported by 30+ agent products — the broadest cross-platform standard in the agent tooling space:

  • Anthropic: Claude Code, Claude.ai, Claude Agent SDK
  • OpenAI: Codex CLI
  • Google: Gemini CLI
  • Microsoft: GitHub Copilot, VS Code
  • Cursor
  • JetBrains: Junie
  • Others: OpenHands, Roo Code, Goose (Block), Pi, OpenCode, Amp, Letta, Factory, Databricks, Snowflake, Mistral Vibe, AWS Kiro, TRAE (ByteDance), Spring AI, Laravel Boost

This is notable because MCP — the other Anthropic-originated standard — took months to get broad adoption. Skills achieved comparable cross-platform support faster, partly because the spec is simpler (it's just markdown files) and partly because MCP paved the way for Anthropic-led standards.

10.5 The Self-Extending Agent Pattern

Pi (Section 9) takes Skills to their logical conclusion: instead of installing pre-built skills, users ask the agent to write its own skills. Armin Ronacher's workflow: describe a workflow you want automated → the agent creates a SKILL.md with instructions and scripts → the skill is immediately available as a slash command. The agent builds and maintains its own capability library.

This connects to Anthropic's vision: "future capabilities enabling agents to autonomously create and evaluate their own skills." The line between "agent uses skills" and "agent creates skills" is already blurring.

10.6 How to Talk About It

"Skills are the missing layer between MCP tools and the agent's system prompt. MCP gives the agent actions — search, create, deploy. The system prompt gives it personality and constraints. Skills give it procedural knowledge — how to use those actions for specific workflows. The progressive disclosure design means you can have hundreds of skills available without paying any context cost until the agent actually needs one. And the fact that 30+ agent products adopted the same format means skills are genuinely portable — write once, use in Claude Code, Cursor, Copilot, Codex."


11. Pattern Comparison Table

PatternUsed ByTradeoff
Initializer + Coding AgentAnthropicSimple but rigid; one-feature-per-session limits throughput
Plan/Generate/EvaluateAnthropic20x cost for dramatically better quality
Progressive Compaction (4 layers)Claude CodePreserves recent detail, risks losing old context
RL-Trained Self-SummarizationCursor50% error reduction at 1/5th the tokens; requires custom training
Planner-Worker HierarchyCursor~1K commits/hour but requires bounded error tolerance
File-System-as-ContextManusUnlimited storage but adds tool-call overhead
todo.md Attention ManipulationManusCombats lost-in-the-middle; burns tokens on recitation
KV-Cache OptimizationManus10x cost reduction but constrains prompt structure
Sandbox-per-TaskCodex, DevinPerfect isolation but higher infra cost
Plans-as-Repo-ArtifactsCodexVersioned state survives any context reset
Model-Trained HarnessCursorBest fit but requires retraining per model
AST-Based Repo MapsAiderStructural understanding but parsing cost
Lifecycle Hooks (25+ points)Claude CodeDeterministic enforcement but complexity
Fork-Join ParallelismClaude CodeScales to many agents; merge conflicts
Radical Minimalism (4 tools)PiCompetitive benchmarks, max observability; no guardrails
Self-Extending AgentPi / OpenClawAgent builds its own tools; requires strong base model
Agent Skills (progressive disclosure)30+ agentsZero-cost discovery, portable; instructions only, no execution

12. Key Interview Takeaways

  1. The harness matters more than the model. Manus rebuilt theirs 5 times. Cursor trains models specifically for their harness.
  2. Context is a finite resource with diminishing returns. Every approach optimizes for "smallest set of high-signal tokens."
  3. Separate generation from evaluation. Making an agent critique its own work is harder than having a dedicated skeptical evaluator.
  4. The human-shift metaphor is the core design principle. Progress files, git commits, feature checklists enable agent session handoffs.
  5. KV-cache awareness is a production concern that academic papers ignore. Manus's approach saves 10x on costs.
  6. Test through the UI, not the code. Puppeteer/Playwright MCP for E2E verification beats unit tests for agent-built software.
  7. Plans should be first-class artifacts versioned alongside code (Codex pattern).
  8. Stronger models reduce but never eliminate harness complexity. But Pi's competitive benchmarks with 4 tools and a 1K-token prompt show the floor is lower than most assume.
  9. The harness spectrum is a real design choice. Heavy (Manus, Cursor) vs. minimal (Pi) isn't settled — being able to argue both sides shows depth.
  10. Agent-safe deployment is a solved pattern. Rootless deploy users with no sudo, SSH alias separation, sudoers allowlists for specific compose commands, and permission-prompt review of every command. The trust boundary isn't "hope the agent behaves" — it's OS-level access control plus human-in-the-loop approval. (See Anna's Archive MCP and Readr for two independent implementations of this pattern.)