Aetheris
AI Agent execution runtime with event sourcing, checkpoint recovery, and At-Most-Once execution guarantee. Written in Go.
Aetheris
⭐ The Missing Layer for Production-Ready AI Agents
Aetheris is a durable, replayable execution runtime — the "Temporal for Agents" that your production AI systems desperately need.
Quick Start • Documentation • Examples • Blog • Discord
🤔 Why Aetheris?
Your AI agent worked in testing. But production is different.
❌ Worker crashed → Restart from beginning
❌ Tool called twice → Duplicate payments
❌ Need to audit AI decisions → No trace
❌ Agent waiting for approval → Wastes resources
❌ Need to replay failed run → Impossible
Go agent frameworks (LangChainGo, LangGraphGo, ADK) build agents. Aetheris runs them in production.
🎯 What is Aetheris?
Kubernetes for AI Agents
Aetheris manages agents — providing durability, reliability, and observability for production systems.
It's not:
- ❌ Chatbot framework
- ❌ Prompt library
- ❌ RAG system
- ❌ Another way to write agents
It is:
- ✅ Agent execution runtime — host LangChainGo/LangGraphGo/ADK agents
- ✅ Durable execution — survive crashes, resume from checkpoints
- ✅ Reliable orchestrator — at-most-once tool execution
- ✅ Auditable system — full decision history
✨ Key Features
| Feature | What It Means |
|---|---|
| At-Most-Once | Tool calls never repeat, even after crashes |
| Crash Recovery | Resume from checkpoints, not from scratch |
| Deterministic Replay | Reproduce any run for debugging |
| Human-in-the-Loop | Pause for approval, resume without waste |
| Full Audit Trail | Every decision traced |
| Multi-Framework | Plug in LangChainGo/LangGraphGo/ADK |
📊 Use Cases
| Use Case | Description |
|---|---|
| Human-in-the-Loop | Agents pause for approval, resume with full context |
| Compliance & Audit | Event-sourced traceability, replayable evidence |
| Local-First | Private cloud, air-gapped environments |
🚀 Quick Start
# Install
go install github.com/Colin4k1024/Aetheris/cmd/cli@latest
# Or use Docker
./scripts/local-2.0-stack.sh start
# Initialize
aetheris init my-agent
cd my-agent
aetheris run
# Monitor
aetheris jobs list
aetheris trace <job_id>
See Getting Started Guide for details.
🔗 Authoring Strategy
Build agents in Eino, run them on Aetheris for durability, replay, and audit.
🏗️ Architecture
flowchart LR
subgraph authoring["Authoring Layer (Eino-first)"]
einoBuild["Eino Agent Construction"]
otherFrameworks["Other Frameworks (Optional Legacy)"]
end
subgraph control["Aetheris Control Plane"]
api["API / CLI / SDK Facade"]
auth["Auth / RBAC / Audit Policy"]
end
subgraph data["Aetheris Data Plane (Runtime Core)"]
scheduler["Lease Scheduler / Worker Coordinator"]
runner["Durable Runner / Step Executor"]
toolPlane["Tool Plane (Native + MCP Host)"]
replay["Replay / Verify / Trace"]
end
subgraph storage["Durable Stores"]
eventStore["Event Store (Append-only)"]
checkpointStore["Checkpoint Store"]
effectStore["Effect + Invocation Store"]
jobStore["Job Metadata Store"]
end
authoring --> api
api --> scheduler
scheduler --> runner
runner --> toolPlane
runner --> eventStore
runner --> checkpointStore
runner --> effectStore
scheduler --> jobStore
replay --> eventStore
auth --> api
The flow: Eino authoring → Aetheris runtime submission → scheduler/runner execution → durable events/checkpoints/effects → replay/verify/audit.
Core Components
| Component | Path | Responsibility |
|---|---|---|
| API Server | cmd/api/ |
HTTP server (Hertz), creates and interacts with agents |
| Worker | cmd/worker/ |
Background execution worker, schedules and executes jobs |
| CLI | cmd/cli/ |
Command-line tool (init, chat, jobs, trace, replay, etc.) |
| AgentFactory | internal/runtime/eino/agent_factory.go |
Config-driven Eino ADK agent creation (recommended entry point) |
| Tool Bridge | internal/runtime/eino/tool_bridge.go |
Converts Aetheris RuntimeTool → Eino InvokableTool |
| Eino Engine | internal/runtime/eino/engine.go |
Workflow compilation, runner management |
| Agent Runtime | internal/agent/runtime/ |
Core execution engine (DAG compiler + runner) |
| Job Store | internal/agent/runtime/job/ |
Event-sourced durable execution history (PostgreSQL) |
| Scheduler | internal/agent/runtime/job/scheduler.go |
Leases and retries tasks with lease fencing |
| Runner | internal/agent/runtime/runner/ |
Step-level execution with checkpointing |
| Planner | internal/agent/planner/ |
Produces TaskGraph from goals |
| Executor | internal/agent/runtime/executor/ |
Executes DAG nodes using eino framework |
| Effects | internal/agent/effects/ |
At-most-once tool execution guarantee via Ledger |
Execution Flow
User Message → API creates Job (dual-write: event stream + stateful Job)
→ Scheduler picks up pending Job
→ Runner.RunForJob: if Job.Cursor exists, restore from Checkpoint;
otherwise PlanGoal → TaskGraph → Compiler → DAG
→ Steppable executes nodes one by one
→ Each node writes Checkpoint, updates Session.LastCheckpoint and Job.Cursor
→ Recovery resumes from next node
Key Concepts
| Concept | Description |
|---|---|
| Job | Durable task unit, survives worker crashes |
| Step | Single execution unit within a Job |
| Checkpoint | State snapshot after step completion, enables resume |
| Effect | External side effect record (API calls, DB writes) |
| Ledger | Tool invocation authorization ledger (guarantees at-most-once) |
| TaskGraph | Directed acyclic graph of step dependencies |
StepOutcome Semantics
Each step produces exactly one outcome:
| Outcome | Meaning |
|---|---|
| Pure | No side effects; safe to replay |
| SideEffectCommitted | World changed; must not re-execute |
| Retryable | Failure, world unchanged; retry allowed |
| PermanentFailure | Failure; job cannot continue |
| Compensated | Rollback applied; terminal state |
Execution Guarantees
| Guarantee | Description |
|---|---|
| At-Most-Once | Tool calls never repeat, even after crashes |
| Crash Recovery | Agents resume from checkpoints, not from scratch |
| Deterministic Replay | Reproduce any run for debugging or auditing |
| Event Sourcing | Full execution history as append-only event stream |
📈 Why This Matters
LLMs made agents possible.
Aetheris makes agents production-ready.
| Problem | Without Aetheris | With Aetheris |
|---|---|---|
| Worker crash | Restart from beginning | Resume from checkpoint |
| Duplicate calls | Possible ($$$ loss) | Guaranteed at-most-once |
| Debug | Guess what happened | Deterministic replay |
| Audit | Impossible | Full evidence chain |
| Human approval | Wastes resources | StatusParked |
🧩 Templates & Ecosystem
Start fast with production-ready templates:
| Template | Description |
|---|---|
| Customer Service Agent | Multi-turn support with human approval |
| RAG Assistant | Vector search + LLM with citations |
| Autonomous Researcher | Self-directed research agent |
| Multi-Agent Debate | Multi-agent discussion & consensus |
MCP Gateway — Pre-built tools: GitHub, Filesystem, Web Search, Database
VSCode Extension — Code snippets & syntax highlighting
🌍 Community
Discord • Discussions • Docs
⭐ Star us on GitHub!
📄 License
Apache License 2.0 — free for commercial use.
🙏 Thanks
⭐ Star us. Build production agents. Ship with confidence.
📚 Long-tail Keywords & SEO Terms
This section helps improve searchability for specific use cases and related queries.
Core Use Case Keywords
- durable AI agent execution runtime
- AI agent crash recovery and checkpoint
- production AI agent orchestration
- at-most-once tool execution AI
- event-sourced AI agent audit trail
- deterministic AI agent replay
Technical Keywords
- Go AI agent framework production
- LangChainGo production deployment
- LangGraphGo durability
- AI agent human-in-the-loop approval
- AI agent state management
- AI agent workflow checkpointing
- AI agent idempotency guarantee
- AI agent observability and tracing
Industry Keywords
- enterprise AI agent compliance
- AI agent local-first deployment
- AI agent private cloud
- AI agent air-gapped environment
- AI agent regulatory audit
- AI agent financial services compliance
- AI agent healthcare data handling
Feature Keywords
- AI agent checkpoint resume
- AI agent decision replay
- AI agent full stacktrace
- AI agent failure recovery
- AI agent retry with fencing
- AI agent lease management
- AI agent side effect ledger
- AI agent effect store
Integration Keywords
- Eino framework integration
- MCP server agent hosting
- AI agent MCP tool bridge
- ADK agent hosting runtime
- AI agent API gateway
- AI agent webhook integration