Files
compound-engineering-plugin…/tests
Trevin Chow d1d347d54b refactor(ce-doc-review): migrate confidence scoring to anchored rubric
Replace the continuous `confidence` field (0.0-1.0 with per-severity
numeric gates) with a 5-anchor integer enum (0 | 25 | 50 | 75 | 100),
each tied to a behavioral criterion the persona can honestly self-apply.
Adopts the structural techniques from Anthropic's official code-review
plugin (anchored rubric, verbatim persona text, explicit false-positive
catalog) while tuning the filter threshold (>= 50) to document-review
economics.

The rubric is embedded verbatim in subagent-template.md so every persona
sees the same behavioral anchors:

- 0: false positive or pre-existing issue (drop)
- 25: might be real, couldn't verify, stylistic (drop)
- 50: verified but nitpick / advisory (route to FYI subsection)
- 75: double-checked, will hit in practice (actionable tier)
- 100: evidence directly confirms, will happen frequently

Synthesis pipeline changes:

- 3.2 gate: replaced per-severity numeric thresholds (0.50/0.60/0.65/0.75)
  plus 0.40 FYI floor with a single anchor-based rule
- 3.3 dedup: anchor-based tiebreak, falling to document order
- 3.4 cross-persona: promotes one anchor step (50 to 75, 75 to 100)
  instead of the prior +0.10 magic-number boost
- 3.7 routing: explicit anchor + autofix_class matrix; anchor 75 with
  safe_auto demotes to gated_auto (silent apply reserved for anchor 100)
- 3.8 sort: anchor descending, document order as final tiebreak

Each persona file carries a calibration section mapping its domain
criteria to the shared anchors. Product-lens and adversarial explicitly
note that their working ceiling is anchor 75 because premise challenges
cannot be verified against ground truth; this is the nature of the
work, not a calibration problem.

Why the threshold differs from Anthropic's >= 80: doc review has no
linter backstop, premise-level concerns resist verification, and the
routing menu makes dismissal cheap. Code review's economics are opposite
(linter backstop, public PR comments, ground-truth verifiable), so its
threshold should stay higher. ce-code-review migration is deferred.

The reasoning is documented at
docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md
so future contributors can see why the threshold diverges.

Tests: 828 pass (2 new contract assertions added for the schema and
rubric structure). Fixtures updated; no behavioral regressions.
2026-04-21 10:52:24 -07:00
..