Skip to content

How It Works

The problem with if/else

With a normal program, exceptions carry enough information to branch on:

except ValueError:
    handle_bad_input()
except HTTPError as e:
    if e.status == 429:
        retry()

With an LLM agent, the same RuntimeError("something went wrong") can mean ten different things depending on what the agent was doing, what it called, and what it said before failing. The failure lives in the trajectory, not the exception:

  • A loop isn't an exception — it's three identical steps
  • A hallucination isn't an exception — it's the LLM asserting something the tools didn't return
  • A schema mismatch might surface as a ValueError, a JSONDecodeError, or a model-specific error string

if/else on the exception string misses most of this. It also collapses distinct failures into one code path and has to be reimplemented in every agent.

The triage loop

┌─────────────────────────────────────────────────────┐
│                    Agent.run(task)                   │
│                                                      │
│  while True:                                         │
│    try:                                              │
│      result = await fn(task, record_step=...,  ─────┼──▶ your agent fn
│                         update_state=...)            │    calls record_step()
│      await drain_checkpoints()                       │    and update_state()
│      return result          ◀── success              │
│                                                      │
│    except Exception:                                 │
│      failure_type = classifier.classify(trajectory)  │
│      ctx = FailureContext(failure_type, trajectory,  │
│                           attempt_history, ...)      │
│      action = await policy.dispatch(ctx)             │
│      kwargs = execute_action(action)  ───────────────┼──▶ RETRY / REPLAN /
│      attempt += 1                                    │    ROLLBACK / RESUME /
└─────────────────────────────────────────────────────┘    ESCALATE / ABORT

Step 1 — Record steps

Your agent calls record_step(Step(...)) for each observable action. triage injects the callback — nothing to import. Optionally call update_state(dict) to persist data into checkpoints.

Step 2 — Classify the failure

When your agent raises, triage runs the classifier over the recorded trajectory (in a thread, so the event loop is never blocked) and returns one FailureType from 10 possible values.

The default RulesClassifier is pattern-based and makes zero API calls. For ambiguous failures, LLMClassifier or HybridClassifier can be used.

Step 3 — Dispatch to a strategy

The FailurePolicy maps each FailureType to a strategy callable. The strategy receives a FailureContext (trajectory, failure type, attempt history, checkpoint ID) and returns a RecoveryAction.

Step 4 — Execute the recovery

triage executes the action and re-runs your agent with injected context:

Action What happens
RETRY Re-runs the agent; injects _triage_hint
REPLAN Re-runs the agent; injects _triage_hint with new plan instruction
ROLLBACK Restores trajectory + state from checkpoint; injects _triage_hint and _triage_state
RESUME Re-runs agent; injects _triage_subgoal
ESCALATE Raises TriageEscalationError
ABORT Raises TriageAbortError

Separation of concerns

  • Strategies declare intent — a RecoveryAction.ROLLBACK tells the agent to roll back, it doesn't do it
  • agent.py executes it — all state mutation, checkpoint loading, and kwargs injection happens in one place
  • Classifiers are synchronousclassify() is a plain def, not async def. It runs in a thread when called from the async loop
  • No framework imports in coretriage/ only imports anyio, pydantic, and stdlib. Adapters live in triage/adapters/