Project Awesome project awesome

AgentSkeptic

Verifies AI/agent workflows by checking database state after execution, comparing expected vs observed outcomes with read-only SQL.

Package GitHub

AgentSkeptic — state vs trace

Trust reality, not traces.

Declared tool effects vs read-only store facts.

Traces say success. Your data often disagrees. Read-only checks at verify time compare tool claims to stored state—before you ship or bill.

Bundled terminal proof

### Success (`wf_complete`)

workflow_id: wf_complete
workflow_status: complete
trust: TRUSTED: Every step matched the database under the configured verification rules.
steps:
  - seq=0 tool=crm.upsert_contact result=Matched the database.

{
  "schemaVersion": 15,
  "workflowId": "wf_complete",
  "status": "complete",
  "steps": [{ "seq": 0, "toolId": "crm.upsert_contact", "status": "verified" }]
}

### Failure (`wf_missing`)

workflow_id: wf_missing
workflow_status: inconsistent
steps:
  - seq=0 tool=crm.upsert_contact result:Expected row is missing from the database (the log implies a write that is not present).
    reference_code: ROW_ABSENT

{
  "schemaVersion": 15,
  "workflowId": "wf_missing",
  "status": "inconsistent",
  "steps": [
    {
      "seq": 0,
      "toolId": "crm.upsert_contact",
      "status": "missing",
      "reasons": [{ "code": "ROW_ABSENT" }]
    }
  ]
}

How it works

Default path: DecisionGate before you act

Canonical activation from BootstrapPackInput v1 + a readable database URL (agentskeptic activate writes proof/ under --out on exits 0–2; agentskeptic bootstrap is the legacy verb with the same kernel but no proof subtree — see docs/bootstrap-pack-normative.md):

npx agentskeptic activate --input ./path/to/workflow-bootstrap-input.json \
  --db ./path/to/readable.sqlite \
  --out ./path/to/agent-pack

Lifecycle

  1. Keep agentskeptic/tools.json in version control; update when toolId → SQL mapping changes.
  2. Emit observations via the canonical SDK emitter, then append emitted rows to the gate buffer. Optionally mirror the same JSON lines to agentskeptic/events.ndjson for CI replay.
  3. Immediately before any irreversible side effect (ship, bill, ticket close), call await gate.assertSafeForIrreversibleAction() so unsafe or unknown trust never reaches customers.

Install

npm install agentskeptic

Code

npx agentskeptic init --framework none --database sqlite --yes
import { join } from "node:path";
import { AgentSkeptic, BufferSink } from "agentskeptic";

const skeptic = new AgentSkeptic({
  registryPath: join("agentskeptic", "tools.json"),
  databaseUrl: join(process.cwd(), "demo.db"),
});

const gate = skeptic.gate({ workflowId: "wf_complete" });
const sink = new BufferSink();
const emitter = skeptic.createEmitter({
  workflowId: "wf_complete",
  sink,
  defaultToolObservedSchemaVersion: 2,
});
await emitter.emitToolObserved({
  toolId: "crm.upsert_contact",
  params: { recordId: "c_ok", fields: { name: "Alice", status: "active" } },
});
await emitter.finalizeRun();
for (const ev of sink.snapshot()) gate.appendRunEvent(ev);
gate.assertEmissionQuality();
// Before an irreversible action:
await gate.assertSafeForIrreversibleAction();

See docs/integrate.md (v2 integrator SSOT) and docs/migrate-2.md for 1.x → 2.0 renames.

Buy vs build: why not only SQL checks

The scar (one pattern, over and over): the trace says the tool succeeded—here crm.upsert_contact / contacts—but the row is missing or wrong. The repo demo names it wf_missing / ROW_ABSENT; the same failure shape applies whenever your registry maps tool activity to SQL state (ledgers, orders, tickets—not only CRM). That is not a logging problem—it is a money and risk problem the moment you ship, bill, close, or treat the run as audit evidence.

Why “we’ll just write SQL checks” stops working

  • Drift: Scripts rot when schemas and workflows change; nobody keeps them current.
  • No ownership: The author leaves; the checks become folklore.
  • Not an org contract: Expectations live in heads and one-off files—not in a shared tools.json + NDJSON contract everyone replays.
  • CI and audit: Ad-hoc checks are skipped locally and rarely ship as repeatable artifacts; when the issue is cross-team or compliance, scripts do not hold. Use CI lock / enforcement when you need pins (docs/ci-enforcement.md).

What you standardize on instead: when the row backs revenue or customer promises, you stop betting the business on whoever wrote the last script. AgentSkeptic is how the org owns the check: one verifier, one replayable contract, Quick → Contract when stakes go up—explore with Quick Verify (docs/quick-verify-normative.md), lock with contract mode and a tools.json registry when “we ran a query” is not evidence (docs/agentskeptic.md). That is the responsible default once the failure mode hurts.

Core mechanism: Read-only SQL checks that your database at verification time matches expectations derived from structured tool activity—not whether a trace step “succeeded.”

Read-only checks at verify time—not color.

Advanced

Canonical runnable (same API as README ### Code): after npm run build, run node examples/decision-gate-canonical.mjs.

Try it (about one minute)

This is the fastest way to see ROW_ABSENT versus verified on the same screen—the concrete failure mode the section above is about (bundled CRM-style demo, not your production incident yet).

Prerequisite: Node.js ≥ 22.13 (built-in node:sqlite), or use Docker below.

Fast first run on your own DB (canonical local truth loop): after npm install and npm run build, run:

agentskeptic loop --workflow-id <id> --events <path> --registry <path> --db <sqlitePath>

This single command verifies state, emits TRUSTED / NOT TRUSTED / UNKNOWN, shows a next action when non-trusted, persists local run history, and auto-compares against your latest compatible prior run. Normative operator contract: docs/local-feedback-loop.md.

Advanced compatibility paths: agentskeptic quick, agentskeptic crossing, and agentskeptic verify-integrator-owned remain supported for specialized workflows and CI parity; they are no longer the default local operator path.

npm install
npm start

What you should see: npm start builds, seeds examples/demo.db, and runs two workflows from examples/events.ndjson with examples/tools.json. The first case ends complete / verified; the second inconsistent / missing with reason ROW_ABSENT. That contrast is the product on one screen.

npm install does not compile TypeScript. To run the CLI without npm start, run npm run build first so dist/ exists.

Docker quickstart (optional)

Use this when you want the bundled demo without Node 22.13+ on the host. The repo is bind-mounted so examples/demo.db stays on your machine.

Bash / macOS / Linux (repo root):

docker run --rm -it -v "$PWD:/work" -w /work node:22-bookworm bash -lc "npm install && npm start"

PowerShell (repo root):

docker run --rm -it -v "${PWD}:/work" -w /work node:22-bookworm bash -lc "npm install && npm start"

Minimal model (event → registry → result)

One structured observation (NDJSON line; full schema in Event line schema):

{"schemaVersion":1,"workflowId":"wf_complete","seq":0,"type":"tool_observed","toolId":"crm.upsert_contact","params":{"recordId":"c_ok","fields":{"name":"Alice","status":"active"}}}

Registry entry (excerpt; full file is examples/tools.json) telling the engine how that toolId maps to a row check:

{
  "toolId": "crm.upsert_contact",
  "verification": {
    "kind": "sql_row",
    "table": { "const": "contacts" },
    "identityEq": [{ "column": { "const": "id" }, "value": { "pointer": "/recordId" } }],
    "requiredFields": { "pointer": "/fields" }
  }
}

When the row matches: workflow result (excerpt; demo prints full JSON to stdout):

{
  "workflowId": "wf_complete",
  "status": "complete",
  "steps": [{ "seq": 0, "toolId": "crm.upsert_contact", "status": "verified" }]
}

When the row is missing or fields disagree, you get inconsistent / missing and reason codes such as ROW_ABSENT.

What this is (and is not)

Retries, partial failures, and race conditions mean a success flag in a trace is not proof the intended row exists with the right values. The engine derives expected state from your registry and events and compares it to observed state with read-only SELECTs.

This is This is not
A SQL ground-truth state check against expectations from structured tool activity Generic observability, log search, or arbitrary unstructured logs
A verifier for persisted state after agent or automation workflows A test runner for application code
Proof that observed DB state matched expectations at verification time Proof that a tool executed, wrote, or caused that state

This is for you if you need persisted-row SQL truth after agent or automation runs when the trace looks fine but the DB might not.

This is not for you if you need proof a tool executed, log search as verification, or a model where read-only SQL against your app DB is not the right check. Homepage “for you / not for you” copy lives in website/src/content/productCopy.ts (single source with the site).

Trust boundary (once): a green trace does not prove the row exists with the right values—only whether read-only SELECTs matched expected rows under your rules, not deep causality.

Declared → expected → observed (how reports reason about runs):

  1. Declared — what the captured tool activity encodes (toolId, parameters).
  2. Expected — what should hold in SQL under the rules (in Quick Verify, inferred; in contract mode, registry-driven from events).
  3. Observed — what read-only SQL returned at verification time.

Contract path (registry + events)

CLI: after npm install and npm run build, use agentskeptic loop as the default local command (or node dist/cli.js loop). Postgres: --postgres-url instead of --db (exactly one).

Typical integration:

  1. Emit one NDJSON line per tool observation (see Event line schema).
  2. Add a registry entry per toolId (start from examples/templates/).
  3. Run the local truth loop:
npm run build
agentskeptic loop --workflow-id <id> --events <path> --registry <path> --db <sqlitePath>

Replay the bundled files: wf_complete / examples/events.ndjson / examples/tools.json / examples/demo.db (same flags as above).

From source without agentskeptic on PATH: node dist/cli.js with the same flags.

Why SQLite in the demo: file-backed ground truth with no extra services. The demo (re)creates examples/demo.db; verification still uses read-only SQL.

Quick Verify and assurance (optional)

Quick Verify (agentskeptic quick): inferred checks, no registry file; provisional, not audit-final—graduate to contract mode for explicit per-tool expectations. Full contract: docs/quick-verify-normative.md.

Input contract: We only accept structured tool activity—JSON or NDJSON that describes tool calls and parameters our ingest model can extract—not arbitrary logs, traces, or unstructured observability text. Verification uses read-only SQL against your database; API-only or non-SQL systems are out of scope for this tool.

npm run build
agentskeptic quick --input test/fixtures/quick-verify/pass-line.ndjson --db examples/demo.db --export-registry ./quick-export.json

Use --postgres-url instead of --db; - as --input reads stdin.

Assurance (assurance run / assurance stale): multi-scenario sweeps and staleness over saved reports; success paths emit one AssuranceOutputV1 JSON line on stdout (embedded runReport)—Assurance subsystem, examples/assurance/manifest.json.

Sample output (contract demo)

The npm start driver prints human report + workflow JSON to stdout (one stream for the demo). Normal CLI: machine JSON on stdout, human report on stderrHuman truth report. Full success/failure transcripts (same strings as below) are in the acquisition fold at the top of this README.

Success (wf_complete)

Interpretation: Under the configured rules, expected state matched observed SQL for this step—state alignment, not proof of execution.

Failure (wf_missing)

Interpretation: Expected state from the tool activity implied a row observed SQL did not find—inconsistent—a gap traces alone often miss. Still not proof a write was attempted or rolled back.

How this differs from logs, tests, and observability

Approach What it tells you
Logs / traces A step ran, duration, errors—not “row X has columns Y.”
Unit / integration tests Code paths in your repo—not production agent runs against live DB state.
Metrics / APM Health and latency—not semantic equality of persisted records.
Ad-hoc SQL checks / one-off scripts Same failure mode as Buy vs build—drift, weak ownership, not a durable contract.
agentskeptic Whether observed SQL matches expectations from declared tool parameters (contract mode), via read-only SQL—not proof the tool executed.

When to run it

Run after a workflow (or CI replay of its log), before you treat the outcome as safe for customer-facing or regulated actions.

Inputs: NDJSON observations, registry JSON, read-only SQLite or Postgres. Semantics: docs/relational-verification.md.

Typical uses: block a release, trigger human review, open an incident, or attach a verification artifact to an audit trail.

CI with over-time guarantees: use stateful agentskeptic enforce baseline/check/accept lifecycle—docs/ci-enforcement.md.

Further capabilities (reference)

Everything beyond core contract verification lives in docs/agentskeptic.md—subcommands, hooks, bundles, debug, plan transition, human report layout, exit codes.

Documentation map

Doc Purpose
docs/contract.md Verification Contract Manifest SSOT — names, hashes, and versions the event/registry/registry-export schemas; one URL, one CI gate
docs/epistemic-contract.md Normative epistemic contract (grounded output vs funnel; ranking limits; telemetry proxies)—single authored source; other docs link or generate from here
README — Buy vs build Canonical buy vs build narrative (failure mode, scripts limits, Quick → Contract)
docs/agentskeptic.md Authoritative CLI and behavior reference (SSOT)
docs/quick-verify-normative.md Quick Verify normative contract
docs/verification-product.md Product intent, trust boundary, authority matrix
docs/reconciliation-vocabulary.md Reconciliation dimension IDs and UI mapping
docs/verification-operational-notes.md First-run runbooks, TTFV, export vs replay coverage
docs/langgraph-reference-boundaries.md LangGraph reference path: emitter/CLI boundaries and test chain
docs/langgraph-checkpoint-trust.md LangGraph checkpoint trust: v3 wire, terminal contract, shared kernel, production gate
docs/relational-verification.md Relational verification semantics
docs/ci-enforcement.md CI enforcement and lock fixtures
docs/correctness-definition-normative.md Correctness and limits (normative)

Development and testing

Why SQLite: same note as under Contract path (file-backed demo DB; read-only verification SQL).

npm test runs npm run verification:truth (regeneration + contract gate, Postgres distribution, then full journey suite). Requires DATABASE_URL and TELEMETRY_DATABASE_URL (see website/.env.example). Ordering: docs/testing.md.

Full CI parity (Postgres + Playwright for Debug Console): set POSTGRES_ADMIN_URL and POSTGRES_VERIFICATION_URL, then npm run test:ci. See docs/testing.md, .github/workflows/ci.yml, and: docker run -d --name etl-pg -p 5432:5432 -e POSTGRES_PASSWORD=postgres postgres:16.

Commercial CLI (npm) vs OSS (this repo)

Commercial metering (published npm) uses AGENTSKEPTIC_API_KEY + POST /api/v1/usage/reserve as documented in docs/commercial.md — account-pooled quota per billing month.

OSS/unmetered CLI for single-run verification: clone this repo and use the OSS build (WF_BUILD_PROFILE=oss / default npm run build artifact). State over-time enforce needs the commercial CLI and a paid entitlement.

Canonical write-up: docs/commercial.md (npm package, Stripe, keys, telemetry, validation, entitlements; operator metrics in docs/funnel-observability.md—disable with AGENTSKEPTIC_TELEMETRY=0). OSS builds in this repo run contract verify / quick without a license server for stateless runs. Stateful agentskeptic enforce and over-time guarantees require a commercial build per docs/commercial-enforce-gate-normative.md. Example workflow: examples/github-actions/agentskeptic-commercial.yml.

Status, contributing, security

Maturity: 0.x (package.json). APIs, CLI flags, and JSON schemas may evolve; rely on tests and docs for current contracts.

Contributing: see CONTRIBUTING.md.

Security: see SECURITY.md.

License

Released under the MIT LicenseLICENSE.

Back to Testing