Commit Graph

6 Commits

Author SHA1 Message Date
Kazuki Yamada 54c6a3d238 fix(website): Address claude third-pass review on siteverify metric
Six items from claude's incremental review (`12:48:43Z`):

- monitoring/dashboard.json: Group the outcomes widget by both
  `metric.label.outcome` and `metric.label.reason`. Previously all
  failures collapsed into a single `turnstile_failed` series, which
  contradicted the README claim that the `reason` label drives the
  breakdown.
- monitoring/metrics/*.yaml: Narrow the metric filter to
  `jsonPayload.event=("turnstile_siteverify" OR "pack_completed")`.
  Without this anchor, any future code path attaching
  `siteverifyDurationMs` to an unrelated log silently joins the
  distribution and creates new metric label values.
- usePackRequest.ts: Mirror `progressMessage.value = null` alongside
  the `progressStage.value = null` clear on token-acquisition aborted /
  error branches. Prevents a future edit setting a verifying message
  from leaking prior-run state.
- turnstile.test.ts: Add a focused `describe` block with five tests
  asserting `siteverifyDurationMs` is attached to every post-siteverify
  log (one success path + four reject branches). The metric YAML
  filters on field presence, so a refactor that drops the field on any
  branch would silently break the metric without other tests failing.
  Uses the existing `vi.spyOn(logger, ...)` pattern; no clock injection
  needed.
- monitoring/README.md: Note that the metric filter pins
  `service_name="repomix-server-us"`, so future regions (`-eu`,
  `-asia`) silently drop out until the filter is broadened or
  per-region counterparts applied.
- monitoring/README.md: Add a `gcloud logging metrics describe` snippet
  for verifying a YAML edit was actually applied (gcloud update is
  silent on no-op vs effective change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:59:53 +09:00
Kazuki Yamada fa06e5059c fix(website): Address PR review feedback on siteverify metric
Six items from gemini, claude initial review, and claude follow-up:

- turnstile.ts: Update misleading comment that claimed the metric filters
  on `event=turnstile_siteverify` and `outcome=success`. The actual
  Cloud Monitoring metrics in `monitoring/metrics/` filter on
  `siteverifyDurationMs` field presence, which uniformly captures both
  the parallel success log (event=turnstile_siteverify) and the four
  rejectAndLog failure paths (event=pack_completed). The comment
  contradicted README and YAML and would mislead future readers.
- turnstile.ts: Wrap rejectAndLog in a local `rejectWithDuration` helper
  so every post-siteverify branch automatically carries
  `siteverifyDurationMs`. Prevents drift if a fifth reject reason gets
  added later.
- client.ts: Split the wire-protocol `PackProgressStage` (server-emitted
  SSE values) from the display-only `DisplayProgressStage` superset that
  adds `verifying`. Keeping the synthetic stage out of the wire type
  prevents silent divergence with the server's `PackProgressStage`.
- usePackRequest.ts, TryItLoading.vue, TryItResult.vue: Switch the
  display-side type to `DisplayProgressStage`. `onProgress` callbacks
  still take the wire `PackProgressStage`.
- usePackRequest.ts: Clear `progressStage` on token-acquisition failure
  branches (aborted / error). Functionally invisible since loading=false
  hides the loading UI, but prevents the next submit's verifying flash
  from briefly showing the previous run's stale state.
- monitoring/metrics/turnstile_siteverify_duration.yaml: Retune the
  exponential bucket layout for the 100ms-1s SLO band where decisions
  get made. Doubling buckets only placed ~3 boundaries between 100ms
  and 1s; growthFactor=1.5 with scale=10 places ~8 boundaries there.
  18 finite buckets cover 10ms to ~9.85s, comfortably above the 5s
  siteverify timeout so timeouts don't land in overflow.
- monitoring/README.md: Document that pre-network rejections
  (secret_missing, missing_token, token_too_long) intentionally don't
  carry siteverifyDurationMs, so they're excluded from both metrics
  but still appear in the existing pack_requests metric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:42:21 +09:00
Kazuki Yamada 76522fc13e chore(monitoring): Check in Turnstile siteverify metric definitions
Move the metric setup from prose-only README instructions to checked-in
YAML files under `monitoring/metrics/` so the dashboard, the metrics it
depends on, and the apply commands all live next to each other.

- `turnstile_siteverify_duration.yaml`: distribution metric on
  `jsonPayload.siteverifyDurationMs`, exponential buckets 1ms-32s.
- `turnstile_siteverify_outcomes.yaml`: counter metric with `outcome`
  and `reason` labels for the success-vs-failure breakdown widget.

README updated with the gcloud commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:24:48 +09:00
Kazuki Yamada aea2401bfc chore(monitoring): Add Turnstile siteverify dashboard widgets
Wire up two new tiles to surface the `siteverifyDurationMs` field added
in the previous commit:

- "Turnstile siteverify latency P50 / P95 / P99" — line chart with a
  1s threshold marker so a steady regression jumps off the chart.
- "Turnstile siteverify outcomes (by outcome)" — stacked area
  breaking down success vs turnstile_failed counts over time.

Both depend on log-based metrics `turnstile_siteverify_duration`
(distribution) and `turnstile_siteverify_outcomes` (counter) that need
to be created once in the GCP Console — README documents the filter,
field, and label extractors so the setup is reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:21:47 +09:00
Kazuki Yamada c95034bafc chore(monitoring): Keep dashboard only, drop metric YAMLs
intent(scope-reduction): this repo operates a single GCP project — IaC
  for log-based metrics adds maintenance cost (Console drift, stale
  YAMLs) without payoff. Metrics are recreated ad-hoc from the Console
  when needed
decision(keep-dashboard): the dashboard definition is still worth
  committing because it's the user-facing layout, easy to damage by
  accidental Console edit, and trivial to restore via
  `gcloud monitoring dashboards create --config-from-file=dashboard.json`
decision(readme-slim): strip README to just the restore command —
  longer prose belonged to the earlier IaC-lite layout that's now gone

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:50:32 +09:00
Kazuki Yamada bd8e069440 chore(monitoring): Add log-based metric definitions and apply script
intent(observability): make the new pack_completed / source / durationMs
  log fields queryable as time series in Cloud Monitoring without
  re-running ad-hoc gcloud logging read pipelines during incidents
decision(iac-lite): store metric definitions as standalone YAMLs under
  website/server/monitoring/metrics/ + an idempotent bash apply script,
  rather than Terraform. The project has no existing IaC for monitoring
  and adding a full Terraform setup would dwarf the actual change
decision(labels): constrain pack_requests labels to
  outcome/source/input_type/cached/format (max 96 time series). Omitted
  repoHost because its cardinality is unbounded — raw value remains in
  the log jsonPayload for ad-hoc queries
decision(oom-split): separate oom_terminations and container_killed
  counters. They fire on distinct textPayloads from Cloud Run's memory
  enforcement paths; keeping them separate makes alert tuning easier
rejected(repoHost-as-label): too many unique hostnames — would create a
  time series per repo host and cost scale with traffic diversity

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:38:52 +09:00