mirror of
https://github.com/EveryInc/compound-engineering-plugin.git
synced 2026-06-26 12:23:01 +02:00
d1d347d54b
Replace the continuous `confidence` field (0.0-1.0 with per-severity numeric gates) with a 5-anchor integer enum (0 | 25 | 50 | 75 | 100), each tied to a behavioral criterion the persona can honestly self-apply. Adopts the structural techniques from Anthropic's official code-review plugin (anchored rubric, verbatim persona text, explicit false-positive catalog) while tuning the filter threshold (>= 50) to document-review economics. The rubric is embedded verbatim in subagent-template.md so every persona sees the same behavioral anchors: - 0: false positive or pre-existing issue (drop) - 25: might be real, couldn't verify, stylistic (drop) - 50: verified but nitpick / advisory (route to FYI subsection) - 75: double-checked, will hit in practice (actionable tier) - 100: evidence directly confirms, will happen frequently Synthesis pipeline changes: - 3.2 gate: replaced per-severity numeric thresholds (0.50/0.60/0.65/0.75) plus 0.40 FYI floor with a single anchor-based rule - 3.3 dedup: anchor-based tiebreak, falling to document order - 3.4 cross-persona: promotes one anchor step (50 to 75, 75 to 100) instead of the prior +0.10 magic-number boost - 3.7 routing: explicit anchor + autofix_class matrix; anchor 75 with safe_auto demotes to gated_auto (silent apply reserved for anchor 100) - 3.8 sort: anchor descending, document order as final tiebreak Each persona file carries a calibration section mapping its domain criteria to the shared anchors. Product-lens and adversarial explicitly note that their working ceiling is anchor 75 because premise challenges cannot be verified against ground truth; this is the nature of the work, not a calibration problem. Why the threshold differs from Anthropic's >= 80: doc review has no linter backstop, premise-level concerns resist verification, and the routing menu makes dismissal cheap. Code review's economics are opposite (linter backstop, public PR comments, ground-truth verifiable), so its threshold should stay higher. ce-code-review migration is deferred. The reasoning is documented at docs/solutions/skill-design/confidence-anchored-scoring-2026-04-21.md so future contributors can see why the threshold diverges. Tests: 828 pass (2 new contract assertions added for the schema and rubric structure). Fixtures updated; no behavioral regressions.