Benchmark suite

Every methodology version is run against a versioned suite of SM&CR scenarios before it ships. A regression blocks the version bump. This page documents the suite, the metrics, and the scope of what it tests.

Last updated: 31 May 2026

Why publish this

A pre-merge regression test is the most concrete defence against silent methodology drift. We publish the existence and structure of that suite so a CCO evaluating CoverProof can verify “we benchmark our prompts” as a structural commitment rather than a marketing claim.

We keep the raw fixture content internal so the suite remains a reliable regression detector and not part of any model’s training surface.

The suite at a glance

Three synthetic SM&CR fixtures covering distinct firm archetypes: fully-covered baseline (all Low expected), high-gap boutique (mixed High/Medium/Low expected), grey-area edge cases (deliberately ambiguous roles).
Per-row legal rationale for the expected verdict is documented alongside each fixture row — explicitly what statutory test the expected classification reflects.
Versioned outputs. Every benchmark run is written to a JSON file underscripts/benchmark-results/{methodologyVersion}-{date}.json recording the methodology version, timestamp, and per-row result.
Diff tool. Two result files can be diffed to surfaceREGRESSION, IMPROVEMENT, DRIFT, andUNCHANGED rows. Regressions block a methodology-version bump.

Metrics computed

Pass rate per fixture. Number of rows where both AI verdicts (s.250 status, governance coverage) match the documented expected result.
Disagreement breakdown. Where the AI disagrees with expected, classified by kind: s.250 status, governance coverage, confidence delta > 30pp.
Methodology version delta. Comparison of pass rate and disagreement breakdown across two methodology versions.

Scope of the regression suite

The benchmark is a regression detector — its job is to catch a methodology version producing worse verdicts than the prior version on known scenarios. Read in that frame:

Targeted, not exhaustive. The current suite is approximately 19 named scenarios across three fixtures, chosen to cover the SM&CR role taxonomy and Section 250 edge cases. A passing run means no regression against those scenarios; it is a quality gate, not a generalisation guarantee.
Synthetic by design. Fixtures are constructed from public role descriptions, which keeps them reproducible, shareable internally, and free of firm-specific data. Real-firm registers add noise, abbreviations, and legacy role labels — the human reviewer queue is where those get caught.
Anchored to the statute. Section 250 has not yet been judicially interpreted. Expected verdicts reflect the verbatim s.250(3) functional test; if a court ever reads the test differently, the methodology and the suite update together.
Internal regression detector. Expected verdicts and prompts are maintained by the same team — this is by design for a fast pre-merge gate. Independent review of the methodology itself is documented separately on Methodology.
Designed to sit alongside your review. Every CoverProof classification routes through your compliance team’s reviewer queue before any declaration is sent. The benchmark catches drift in the model; your team catches anything the model missed on your specific register.

What ships publicly vs stays internal

Public: suite structure, metric definitions, the scope above, and the commitment that regressions block a methodology version bump.
Public from launch: aggregate pass rate per methodology version, with the scope statement linked alongside every published number.
Internal only: raw fixture rows, per-row diffs between methodology versions, and the specific edge-case wording — protecting the suite’s value as a regression detector.

Published benchmark caveats

Source limits

Benchmark context is calculated only from coverage-asset rows with complete provenance. It is source context, not complete market coverage or an FCA dataset.

Sample-size caveat

Published benchmark context reflects the current qualified sample only. Low-count, stale, or mixed-provenance samples are suppressed instead of being stretched into market-wide claims.

Freshness

Freshness is tied to the source date range and retrieval time. A new retrieval does not make the underlying register data newer.

Methodology and hashes

Renderable stats must carry a methodology version and output hash prefix. Missing or mixed provenance suppresses the benchmark output.

Claim boundary

Asset-derived aggregate benchmark context only; sample-size caveat applies; no certainty, not FCA endorsement, not complete market coverage, not a market benchmark, and not a legal conclusion.

No-certainty caveat

Benchmark figures are directional context for interpreting source-role coverage. They do not certify a firm, predict a regulator or court outcome, or prove Section 250 compliance.