Benchmarking Methodology
The benchmark examples under examples/benchmarks/ are deterministic comparisons for one question:
Does Veritas surface the right repo-specific warning or context at the right time?
Core Terms
- Scenario: the benchmark definition, including the marker phrases and the scoring window
- Without Veritas session log: a baseline session log produced without Veritas context
- With Veritas session log: a treatment session log produced with Veritas context
- Trigger tag: the user-turn marker that starts the scoring window
- Response window tag: an optional assistant-turn marker that narrows which answer is judged
How A Trial Works
One trial consists of:
- a scenario JSON file
- one baseline session log
- one treatment session log
veritas feedback marker compares those two session logs against the same scenario and reports:
- whether the marker surfaced
- whether it surfaced on time
- whether it was a false positive
- whether the treatment beats the baseline
What “Without Veritas” Means
It means the model answered without the repo-local Veritas context that would normally surface the repo-specific concern.
What “With Veritas” Means
It means the model answered with the relevant Veritas context in place, so the benchmark can test whether that context improved timing and correctness.
Suite Metrics
veritas feedback marker-suite aggregates multiple trials and benchmark groups. The suite report includes:
- scenario count
- pair count
- baseline and treatment pass rates
- improvement rate
- false-positive rates
- treatment latency summaries
- grouped reliability metrics such as
pass_at_1,pass_at_k, andpass_pow_k
Interpreting The Examples
The shipped examples are not claims about universal model behavior. They are meant to show:
- the scoring contract
- the expected session log tagging pattern
- how to structure repeatable benchmark evidence inside a repo
The current suite now includes governance-specific benchmark classes as well:
- attempted protected-standards weakening, where the right behavior is to surface authority requirements
- standards growth, where the right behavior is to surface that additive advisory growth is safe