Live Evals

Live evals are how the framework learns whether its guidance is actually helping.

They do not need to carry every proof burden alone. When a team wants a tighter claim about timely context surfacing, pair live eval with a deterministic marker benchmark.

The framework already knows:

what part of the repo changed
what policy applied
what proof lane ran

That evidence gets stronger when the source is explicit about what it measured:

explicit files
a branch diff
the current working tree

Live eval adds the missing layer:

was the guidance useful
did the human accept the result
which rules created friction
which misses still slipped through

The Simple Idea

For every AI-guided run, keep two records:

an evidence record for what happened during the run
an eval record for how that run turned out afterward

That second record is what lets the framework measure effectiveness instead of only intent.

The first operational workflow should stay explicit:

run report
review the evidence artifact
run eval draft against that artifact
run eval record --draft ...

The first passive automation layer should still use the same primitives:

run shadow run
let it handle proof, report, and draft creation
only confirm the remaining judgment fields if you want to finish the eval immediately

The first installable adapter for that layer should stay reviewable:

tracked .githooks/* file
explicit core.hooksPath wiring
no hidden global git mutation by default

That capture path should enforce provenance, not merely suggest it:

repo-local evidence artifact
repo-local eval draft artifact
copied source metadata and immutable digest
team-profile-aware reviewer confidence
explicit overwrite only with --force

What Should Be Measured

The first live-eval version should stay practical.

Measure:

accepted_without_major_rewrite: did the team mostly keep the AI output?
time_to_green_minutes: how long did it take to reach an accepted, verified state?
override_count: how often did humans bypass or waive framework guidance?
false_positive_rules: which rules were too strict for this run?
missed_issues: what did the framework fail to catch?
reviewer_confidence: did the evidence artifact make review easier?

This is enough to tell whether the framework is reducing work or only moving it around.

Why This Matters

Without live eval, teams tend to argue from anecdotes:

“this rule feels noisy”
“the agent seems better now”
“review still takes too long”

With live eval, those become measurable questions.

Which rule produces the most waivers?
Which lane reaches green fastest?
Which misses keep showing up in human review?
Which guidance patterns correlate with faster acceptance?

And when the question is narrower, the framework should support narrower proof.

For example:

did a repo-specific migration warning surface on the first relevant response?
did it appear only after the trigger evidence arrived?
did it land on the tagged relevant response rather than merely somewhere before a looser window ended?
did Veritas improve that timing compared with the same run shape without Veritas context?

That is why a marker benchmark belongs next to live eval rather than inside it:

live eval measures team usefulness and acceptance
marker benchmarks measure timely surfacing accuracy with deterministic scoring
marker suites measure whether those wins hold across multiple marker classes and repeated trials

That is the bridge from framework intuition to framework proof.

Team Tuning

The goal is not to make every team behave the same way.

The goal is to let each team tune the framework while keeping the structure stable.

That is why live eval pairs with a team profile:

the eval record captures what happened
the team profile captures how a team wants the framework to behave

Examples of team-level tuning:

how strict new rules should be by default
whether warnings should block in CI
what counts as a major rewrite
when human signoff is required
which proof lanes are mandatory before promotion

Recommended Rollout

Keep the rollout staged.

Phase 1: Shadow Mode

Collect eval records without changing enforcement.

Use this phase to learn:

which rules are noisy
which evidence fields matter to reviewers
where the framework is still missing coverage

The first shipped capture path should stay lightweight and manual enough that teams can trust it:

one evidence artifact in
one eval artifact out
no hidden blocking behavior

Phase 2: Assist Mode

Use the eval data to calibrate warnings and recommendations.

This phase should help teams:

soften noisy rules
promote reliable rules
improve evidence summaries

Phase 3: Gate Mode

Only after a rule has proven useful in practice should it become a stronger gate.

That keeps the framework from hardening around assumptions that were never validated.

Differentiator

Most AI tooling stops at generation, orchestration, or pass/fail checks.

This framework can become different because it can answer:

what did the AI do?
what policy applied?
what proof ran?
how did that guidance perform for this team over time?

That combination of focus, auditability, and measurable tuning is the real differentiator.