Checkpoints¶
Checkpoints let triage restore your agent to a known-good state before retrying after a ROLLBACK action. Each checkpoint is a snapshot of the agent's trajectory and any state explicitly saved via update_state().
What a checkpoint contains¶
@dataclass
class Checkpoint:
id: str # UUID
timestamp: float # Unix timestamp
state: dict[str, Any] # data saved by update_state()
trajectory_snapshot: list[Step] # steps recorded up to this point
Enabling checkpoints¶
Manual checkpoints¶
Call update_state() to persist data, and configure a checkpoint_store. The store saves a checkpoint whenever record_step is called with auto_checkpoint=True, or you can rely on triage to save them automatically.
Auto-checkpoint¶
from triage.checkpoint import InMemoryCheckpointStore
store = InMemoryCheckpointStore()
agent = triage.Agent(
my_agent,
policy=policy,
checkpoint_store=store,
auto_checkpoint=True, # saves a checkpoint after every record_step() call
)
With auto_checkpoint=True, triage queues a checkpoint save after each record_step() call and drains the queue before executing any recovery action. This guarantees a checkpoint is always available when ROLLBACK runs.
Saving agent state¶
Call update_state() from your agent to persist data into checkpoints:
async def my_agent(task: str, *, record_step, update_state, **kwargs):
data = fetch_data(task)
record_step(Step(index=0, action="fetch", tool_output=data))
update_state({"data": data, "step": 0}) # saved into the next checkpoint
return process(data)
On ROLLBACK, triage restores _current_state and injects _triage_state into the next call's kwargs:
async def my_agent(task: str, *, record_step, update_state, **kwargs):
state = kwargs.get("_triage_state", {})
if state:
# skip re-fetching — triage restored it
data = state["data"]
else:
data = fetch_data(task)
Storage backends¶
InMemoryCheckpointStore¶
Default. Holds checkpoints in a Python list. Lost when the process exits.
No extra dependencies. Not concurrency-safe across multiple Agent.run() calls on the same store instance.
SQLiteCheckpointStore¶
Persists checkpoints to a SQLite database file. Survives process restarts.
from triage.checkpoint.sqlite import SQLiteCheckpointStore
store = SQLiteCheckpointStore("checkpoints.db")
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store, auto_checkpoint=True)
| Detail | Value |
|---|---|
| Dependency | aiosqlite>=0.19 |
| Table | checkpoints (id TEXT PK, timestamp REAL, state TEXT, trajectory TEXT) |
latest() |
SELECT ... ORDER BY timestamp DESC LIMIT 1 |
| Connection | New connection per operation — simple and safe |
Do not use ":memory:" as the path — each aiosqlite.connect(":memory:") call opens a fresh, empty database. Use a real file path.
RedisCheckpointStore¶
Persists checkpoints to Redis. Good for distributed agents or when you need TTL/expiration.
import redis.asyncio as aioredis
from triage.checkpoint.redis import RedisCheckpointStore
redis_client = aioredis.Redis(host="localhost", port=6379, decode_responses=True)
store = RedisCheckpointStore(redis=redis_client)
agent = triage.Agent(my_agent, policy=policy, checkpoint_store=store, auto_checkpoint=True)
| Detail | Value |
|---|---|
| Dependency | redis[asyncio]>=5.0 |
| Key scheme | triage:checkpoint:<id> for data; triage:checkpoint:index sorted set |
save() |
Pipeline transaction: SET key + ZADD index (atomic) |
latest() |
ZREVRANGE index 0 0 → load(id) |
RedisCheckpointStore uses dependency injection — pass a pre-configured Redis instance. This makes it testable with fakeredis.
Implementing a custom store¶
Any object satisfying the CheckpointStore protocol works:
from triage.checkpoint import CheckpointStore, Checkpoint
class MyStore:
async def save(self, checkpoint: Checkpoint) -> None: ...
async def load(self, id: str) -> Checkpoint: ...
async def latest(self) -> Checkpoint | None: ...
All three methods must be async def. latest() returns None if no checkpoints have been saved.
Serialization¶
SQLite and Redis backends serialize checkpoints to JSON. The serializer uses a _safe_json() fallback that converts non-serializable values to their repr() string. This means complex objects in state or tool_output are stored lossily. For full fidelity, store only JSON-native types (strings, numbers, lists, dicts) in update_state().
The _pending_checkpoints drain pattern¶
Under auto_checkpoint=True, record_step() queues a coroutine rather than immediately awaiting it. triage drains the queue at three points in Agent.run():
- After a successful agent return — before returning the result
- Before classifying a failure — so the checkpoint is available for
ROLLBACK - Before re-raising
TriageEscalationErrororTriageAbortError
This keeps record_step() synchronous (no async def) while guaranteeing checkpoints are always committed before any policy action runs.