Policies & Recovery Actions¶
A FailurePolicy maps each FailureType to a strategy — an async callable that decides what to do next. The strategy returns a RecoveryAction, and Agent executes it.
Declaring a policy¶
import triage
from triage.strategies.retry import retry_with_tool_manifest, backoff_and_retry
from triage.strategies.replan import replan, resume_from_subgoal
from triage.strategies.rollback import rollback_to_checkpoint
policy = triage.FailurePolicy(
WRONG_TOOL_CALLED = retry_with_tool_manifest(max_attempts=3),
SCHEMA_MISMATCH = retry_with_tool_manifest(max_attempts=2),
EXTERNAL_FAULT = backoff_and_retry(max_attempts=5),
LOOP_DETECTED = replan(hint="Try a different approach."),
CONSTRAINT_IGNORED = replan(hint="Re-read the constraints carefully."),
HALLUCINATED_STATE = rollback_to_checkpoint(),
PLAN_INCOMPLETE = resume_from_subgoal(),
default = triage.FailurePolicy.escalate_by_default(),
)
Any FailureType not listed falls through to default. If default is unset, triage escalates automatically.
RecoveryAction¶
RecoveryAction is the return type of every strategy. It declares intent — Agent executes it.
RETRY¶
Re-runs the agent. Optionally injects a hint or delays before retrying.
RecoveryAction.RETRY()
RecoveryAction.RETRY(hint="The correct tool name is 'search_web'.")
RecoveryAction.RETRY(delay=2.0)
| Parameter | Type | Description |
|---|---|---|
hint |
str \| None |
Injected as _triage_hint kwarg on the next call |
delay |
float |
Seconds to wait before retrying (default 0.0) |
REPLAN¶
Aborts the current plan branch and restarts with a new planning instruction.
| Parameter | Type | Description |
|---|---|---|
hint |
str \| None |
Injected as _triage_hint; defaults to "Generate a new plan." |
ROLLBACK¶
Restores trajectory and state from a checkpoint, then re-runs.
RecoveryAction.ROLLBACK() # uses latest checkpoint
RecoveryAction.ROLLBACK(checkpoint_id="abc-123") # specific checkpoint
On execution, Agent loads the checkpoint and injects:
- _triage_hint — "Rolled back to checkpoint '<id>'."
- _triage_state — the checkpoint's state dict (only if non-empty)
Requires a checkpoint_store to be configured. If no checkpoint is available, triage escalates.
RESUME¶
Continues execution from a named sub-goal.
| Parameter | Type | Description |
|---|---|---|
from_subgoal |
str \| None |
Injected as _triage_subgoal kwarg |
ESCALATE¶
Halts autonomous execution and raises TriageEscalationError.
The calling code can catch TriageEscalationError and route to a human review queue.
ABORT¶
Hard stop. Raises TriageAbortError immediately with no further recovery.
Built-in strategies¶
Import from triage.strategies.*:
retry strategies¶
retry_with_tool_manifest(max_attempts=3)
Retries and injects a hint reminding the agent to use tools from the manifest. Escalates after max_attempts.
backoff_and_retry(max_attempts=5, base_delay=1.0)
Exponential backoff retry. Delay doubles with each attempt: base_delay * 2^n. Escalates after max_attempts.
replan strategies¶
replan(hint=None, max_replans=3)
Returns REPLAN with the given hint. Escalates if the same failure type has already been replanned max_replans times (checked via attempt_history).
resume_from_subgoal()
Returns RESUME with the last incomplete sub-goal from the trajectory.
rollback strategies¶
rollback_to_checkpoint()
Returns ROLLBACK pointing to the latest available checkpoint.
Writing a custom strategy¶
A strategy is any async def that takes a FailureContext and returns a RecoveryAction:
from triage.taxonomy import FailureContext, FailureType
from triage.policy import RecoveryAction
async def smart_external_fault(ctx: FailureContext) -> RecoveryAction:
external_faults = sum(
1 for ft, _ in ctx.attempt_history
if ft == FailureType.EXTERNAL_FAULT
)
if external_faults >= 3:
return RecoveryAction.ESCALATE(
message="External service unavailable after 3 retries."
)
delay = 2.0 ** len(ctx.attempt_history)
return RecoveryAction.RETRY(delay=delay)
policy = triage.FailurePolicy(
EXTERNAL_FAULT=smart_external_fault,
default=triage.FailurePolicy.escalate_by_default(),
)
Policy defaults¶
FailurePolicy provides two convenience defaults:
# Escalate to human on any unhandled failure
default=triage.FailurePolicy.escalate_by_default()
# Hard stop on any unhandled failure
default=triage.FailurePolicy.abort_by_default()
If default is None and a failure type has no strategy, triage escalates automatically.
Kwargs injected by Agent¶
When Agent executes a recovery action, it injects context into the next call's kwargs:
| Key | Set by | Value |
|---|---|---|
_triage_hint |
RETRY, REPLAN, ROLLBACK | Instruction string for the agent |
_triage_subgoal |
RESUME | Sub-goal string to resume from |
_triage_state |
ROLLBACK (non-empty state) | Dict restored from the checkpoint |
Your agent should accept **kwargs and check for these: