CoverProof

Benchmark suite

Every methodology version is run against a versioned suite of SM&CR scenarios before it ships. A regression blocks the version bump. This page documents the suite, the metrics, and the scope of what it tests.

Last updated: 31 May 2026

Why publish this

A pre-merge regression test is the most concrete defence against silent methodology drift. We publish the existence and structure of that suite so a CCO evaluating CoverProof can verify “we benchmark our prompts” as a structural commitment rather than a marketing claim.

We keep the raw fixture content internal so the suite remains a reliable regression detector and not part of any model’s training surface.

The suite at a glance

  • Three synthetic SM&CR fixtures covering distinct firm archetypes: fully-covered baseline (all Low expected), high-gap boutique (mixed High/Medium/Low expected), grey-area edge cases (deliberately ambiguous roles).
  • Per-row legal rationale for the expected verdict is documented alongside each fixture row — explicitly what statutory test the expected classification reflects.
  • Versioned outputs. Every benchmark run is written to a JSON file underscripts/benchmark-results/{methodologyVersion}-{date}.json recording the methodology version, timestamp, and per-row result.
  • Diff tool. Two result files can be diffed to surfaceREGRESSION, IMPROVEMENT, DRIFT, andUNCHANGED rows. Regressions block a methodology-version bump.

Metrics computed

  • Pass rate per fixture. Number of rows where both AI verdicts (s.250 status, governance coverage) match the documented expected result.
  • Disagreement breakdown. Where the AI disagrees with expected, classified by kind: s.250 status, governance coverage, confidence delta > 30pp.
  • Methodology version delta. Comparison of pass rate and disagreement breakdown across two methodology versions.

Scope of the regression suite

The benchmark is a regression detector — its job is to catch a methodology version producing worse verdicts than the prior version on known scenarios. Read in that frame:

  • Targeted, not exhaustive. The current suite is approximately 19 named scenarios across three fixtures, chosen to cover the SM&CR role taxonomy and Section 250 edge cases. A passing run means no regression against those scenarios; it is a quality gate, not a generalisation guarantee.
  • Synthetic by design. Fixtures are constructed from public role descriptions, which keeps them reproducible, shareable internally, and free of firm-specific data. Real-firm registers add noise, abbreviations, and legacy role labels — the human reviewer queue is where those get caught.
  • Anchored to the statute. Section 250 has not yet been judicially interpreted. Expected verdicts reflect the verbatim s.250(3) functional test; if a court ever reads the test differently, the methodology and the suite update together.
  • Internal regression detector. Expected verdicts and prompts are maintained by the same team — this is by design for a fast pre-merge gate. Independent review of the methodology itself is documented separately on Methodology.
  • Designed to sit alongside your review. Every CoverProof classification routes through your compliance team’s reviewer queue before any declaration is sent. The benchmark catches drift in the model; your team catches anything the model missed on your specific register.

What ships publicly vs stays internal

  • Public: suite structure, metric definitions, the scope above, and the commitment that regressions block a methodology version bump.
  • Public from launch: aggregate pass rate per methodology version, with the scope statement linked alongside every published number.
  • Internal only: raw fixture rows, per-row diffs between methodology versions, and the specific edge-case wording — protecting the suite’s value as a regression detector.

Related