mirror of
https://github.com/yamadashy/repomix.git
synced 2026-05-30 11:18:53 +02:00
68a47b9149
When `tokenCountTree` is enabled `calculateSelectiveFileMetrics` already
tokenizes every file individually on the primary worker pool. The original
`calculateOutputMetrics` then re-tokenized the full output a second time, split
into 200 KB chunks, to compute `totalTokens`. On large repos with the tree
display enabled, this second pass was the single longest task in the
`calculateMetrics` `Promise.all`, consuming roughly 1 second of worker time
that duplicated work already done for the per-file counts.
This change introduces a fast path for the common case (xml / markdown / plain
output, non-parsable, single-part): walk the generated output with
`indexOf(file.content, cursor)` once per file to splice file contents out of
the output, tokenize only the remaining "wrapper" (template boilerplate +
directory tree + git diff/log + per-file headers), and compute
`totalTokens = Σ per-file tokens + wrapper tokens`.
The accuracy delta versus the old 200 KB-chunk approach is bounded by BPE
merges across file↔wrapper boundaries; on the repomix repository itself the
measured error was 309 / 1,284,067 tokens ≈ 0.024 %, comparable to the chunk
boundary error the existing approach already accepts.
## Implementation
- `src/core/metrics/calculateMetrics.ts`
- Add `extractOutputWrapper(output, processedFilesInOutputOrder)` which
walks the output with a single forward cursor. Returns `null` and
triggers a fall back to `calculateOutputMetrics` if any file content is
not found (e.g., template escaped it, output was split, order mismatch).
- Add `canUseFastOutputTokenPath(config)` gate: only enabled when
`tokenCountTree` is truthy, `splitOutput` is undefined, `parsableStyle`
is false, and the style is `xml` / `markdown` / `plain`. JSON output
and parsable XML go through `JSON.stringify` / `fast-xml-builder` which
escape file contents, so `indexOf(content)` would miss them.
- In `calculateMetrics`, when the fast path is available and wrapper
extraction succeeds, replace `outputMetricsPromise` with a promise that
awaits the already-running `selectiveFileMetricsPromise`, sums the
per-file token counts, and dispatches a single `runTokenCount` on the
extracted wrapper string. The rest of the `Promise.all` is unchanged.
- `src/core/packager.ts`
- Call `sortOutputFiles(filteredProcessedFiles, config)` once in `pack`
immediately after suspicious-file filtering and use its result as
`processedFiles` downstream (for `produceOutput`, `calculateMetrics`,
and the final result object). `generateOutput` internally calls
`sortOutputFiles` as well, which is stable and memoized via
`fileChangeCountsCache`, so the two now share the single git-log
subprocess result and consumers see files in the exact order they
appear in the output. This is a precondition for the fast path's
forward-walk extraction.
- Expose `sortOutputFiles` on `defaultDeps` so existing packager unit
tests can inject their own implementation.
- `tests/core/packager/diffsFunctionality.test.ts`
- Extend the `gitRepositoryHandle.js` `vi.mock` to also stub
`isGitInstalled` and `getFileChangeCount`, since `sortOutputFiles`
resolves its default dependencies from that module at module load time.
All 1102 existing tests pass unchanged; lint is clean.
## Benchmark
Interleaved 30-run benchmark against the repomix repo itself (1018 files,
~4 MB xml output, `tokenCountTree: 50000`, `sortByChanges: true`, `includeDiffs`
and `includeLogs` enabled via the repo's own `repomix.config.json`):
base median: 2735.2 ms [2389 - 3528] IQR=367 ms
opt median: 2373.6 ms [2125 - 2653] IQR=293 ms
delta: -361.6 ms (-13.22%)
Verbose trace before/after (single run, representative):
before:
Selective metrics calculation completed in 639 ms
Output token count completed in 1046 ms
Calculate Metrics wall: 1296 ms
after:
Selective metrics calculation completed in 579 ms
Fast-path output tokens: files=1017293, wrapper=33678 (126996 chars)
Calculate Metrics wall: ~580 ms
The savings are concentrated in the `calculateMetrics` phase, which was the
dominant critical path in the final `Promise.all` for tokenCountTree runs on
large repos.