Live Evals

Live evals are how the framework learns whether its guidance is actually helping.

They do not need to carry every proof burden alone. When a team wants a tighter claim about timely context surfacing, pair live eval with a deterministic marker benchmark.

The framework already knows:

That evidence gets stronger when the source is explicit about what it measured:

Live eval adds the missing layer:

The Simple Idea

For every AI-guided run, keep two records:

  1. an evidence record for what happened during the run
  2. an eval record for how that run turned out afterward

That second record is what lets the framework measure effectiveness instead of only intent.

The first operational workflow should stay explicit:

  1. run report
  2. review the evidence artifact
  3. run eval draft against that artifact
  4. run eval record --draft ...

The first passive automation layer should still use the same primitives:

  1. run shadow run
  2. let it handle proof, report, and draft creation
  3. only confirm the remaining judgment fields if you want to finish the eval immediately

The first installable adapter for that layer should stay reviewable:

That capture path should enforce provenance, not merely suggest it:

What Should Be Measured

The first live-eval version should stay practical.

Measure:

This is enough to tell whether the framework is reducing work or only moving it around.

Why This Matters

Without live eval, teams tend to argue from anecdotes:

With live eval, those become measurable questions.

And when the question is narrower, the framework should support narrower proof.

For example:

That is why a marker benchmark belongs next to live eval rather than inside it:

That is the bridge from framework intuition to framework proof.

Team Tuning

The goal is not to make every team behave the same way.

The goal is to let each team tune the framework while keeping the structure stable.

That is why live eval pairs with a team profile:

Examples of team-level tuning:

Keep the rollout staged.

Phase 1: Shadow Mode

Collect eval records without changing enforcement.

Use this phase to learn:

The first shipped capture path should stay lightweight and manual enough that teams can trust it:

Phase 2: Assist Mode

Use the eval data to calibrate warnings and recommendations.

This phase should help teams:

Phase 3: Gate Mode

Only after a rule has proven useful in practice should it become a stronger gate.

That keeps the framework from hardening around assumptions that were never validated.

Differentiator

Most AI tooling stops at generation, orchestration, or pass/fail checks.

This framework can become different because it can answer:

That combination of focus, auditability, and measurable tuning is the real differentiator.