Benchmarking Methodology

The benchmark examples under examples/benchmarks/ are deterministic comparisons for one question:

Does Veritas surface the right repo-specific warning or context at the right time?

Core Terms

How A Trial Works

One trial consists of:

  1. a scenario JSON file
  2. one baseline session log
  3. one treatment session log

veritas feedback marker compares those two session logs against the same scenario and reports:

What “Without Veritas” Means

It means the model answered without the repo-local Veritas context that would normally surface the repo-specific concern.

What “With Veritas” Means

It means the model answered with the relevant Veritas context in place, so the benchmark can test whether that context improved timing and correctness.

Suite Metrics

veritas feedback marker-suite aggregates multiple trials and benchmark groups. The suite report includes:

Interpreting The Examples

The shipped examples are not claims about universal model behavior. They are meant to show:

The current suite now includes governance-specific benchmark classes as well: