Benchmarking Methodology

The benchmark fixtures under examples/benchmarks/ are deterministic comparisons for one question:

Does Veritas surface the right repo-specific warning or context at the right time?

Core Terms

How A Trial Works

One trial consists of:

  1. a scenario JSON file
  2. one baseline transcript
  3. one treatment transcript

veritas eval marker compares those two transcripts against the same scenario and reports:

What “Without Veritas” Means

It means the model answered without the repo-local adapter, policy-pack, or related Veritas framing that would normally surface the repo-specific concern.

What “With Veritas” Means

It means the model answered with the relevant Veritas context in place, so the benchmark can test whether that context improved timing and correctness.

Suite Metrics

veritas eval marker-suite aggregates multiple trials and benchmark groups. The suite report includes:

Interpreting The Fixtures

The shipped fixtures are examples, not claims about universal model behavior. They are meant to show:

The current suite now includes governance-specific benchmark classes as well: