How It Works¶
The problem with if/else¶
With a normal program, exceptions carry enough information to branch on:
With an LLM agent, the same RuntimeError("something went wrong") can mean ten different things depending on what the agent was doing, what it called, and what it said before failing. The failure lives in the trajectory, not the exception:
- A loop isn't an exception — it's three identical steps
- A hallucination isn't an exception — it's the LLM asserting something the tools didn't return
- A schema mismatch might surface as a
ValueError, aJSONDecodeError, or a model-specific error string
if/else on the exception string misses most of this. It also collapses distinct failures into one code path and has to be reimplemented in every agent.
The triage loop¶
┌─────────────────────────────────────────────────────┐
│ Agent.run(task) │
│ │
│ while True: │
│ try: │
│ result = await fn(task, record_step=..., ─────┼──▶ your agent fn
│ update_state=...) │ calls record_step()
│ await drain_checkpoints() │ and update_state()
│ return result ◀── success │
│ │
│ except Exception: │
│ failure_type = classifier.classify(trajectory) │
│ ctx = FailureContext(failure_type, trajectory, │
│ attempt_history, ...) │
│ action = await policy.dispatch(ctx) │
│ kwargs = execute_action(action) ───────────────┼──▶ RETRY / REPLAN /
│ attempt += 1 │ ROLLBACK / RESUME /
└─────────────────────────────────────────────────────┘ ESCALATE / ABORT
Step 1 — Record steps¶
Your agent calls record_step(Step(...)) for each observable action. triage injects the callback — nothing to import. Optionally call update_state(dict) to persist data into checkpoints.
Step 2 — Classify the failure¶
When your agent raises, triage runs the classifier over the recorded trajectory (in a thread, so the event loop is never blocked) and returns one FailureType from 10 possible values.
The default RulesClassifier is pattern-based and makes zero API calls. For ambiguous failures, LLMClassifier or HybridClassifier can be used.
Step 3 — Dispatch to a strategy¶
The FailurePolicy maps each FailureType to a strategy callable. The strategy receives a FailureContext (trajectory, failure type, attempt history, checkpoint ID) and returns a RecoveryAction.
Step 4 — Execute the recovery¶
triage executes the action and re-runs your agent with injected context:
| Action | What happens |
|---|---|
RETRY |
Re-runs the agent; injects _triage_hint |
REPLAN |
Re-runs the agent; injects _triage_hint with new plan instruction |
ROLLBACK |
Restores trajectory + state from checkpoint; injects _triage_hint and _triage_state |
RESUME |
Re-runs agent; injects _triage_subgoal |
ESCALATE |
Raises TriageEscalationError |
ABORT |
Raises TriageAbortError |
Separation of concerns¶
- Strategies declare intent — a
RecoveryAction.ROLLBACKtells the agent to roll back, it doesn't do it agent.pyexecutes it — all state mutation, checkpoint loading, and kwargs injection happens in one place- Classifiers are synchronous —
classify()is a plaindef, notasync def. It runs in a thread when called from the async loop - No framework imports in core —
triage/only importsanyio,pydantic, and stdlib. Adapters live intriage/adapters/