Commit Graph

40 Commits

Author SHA1 Message Date
Claude cf013a0c8f perf(shared): Raise TASKS_PER_THREAD from 100 to 200 to reduce worker contention
Background
----------
On a typical CLI run (`node bin/repomix.cjs --include 'src,tests' --quiet`,
258 files, 4-vCPU host), the metrics worker pool was sized as
`ceil(258 / 100) = 3 workers`. Combined with the security pool's hard cap
of 2 workers (securityCheck.ts:90) and the main thread, the process held
6 active threads on 4 cores during the overlap of `validateFileSafety`
and `calculateMetrics`.

Each metrics worker independently parses gpt-tokenizer's ~2.2 MB
`o200k_base.js` BPE table on its first task — a ~200-300 ms pure-CPU
operation per worker. Spawning 3 cold metrics workers in the warm-up
phase (calculateMetrics.ts:46-48) therefore drove the security workers
off the CPU during their own (concurrent) cold-start, inflating the
critical-path security phase.

Change
------
Raise `TASKS_PER_THREAD` from 100 to 200 so:

- ≤200 file repos:    1 metrics worker (was 1)         — no change
- 201-400 file repos: 2 metrics workers (was 3)        — -1 worker, the win
- 401-600 file repos: 3 metrics workers (was 4-cap)    — -1 worker
- 601-800 file repos: 4 metrics workers (was 4-cap)    — no change
- 801+ file repos:    4 metrics workers (was 4-cap)    — no change (cap)

For the 258-file benchmark this brings active workers during the
metrics+security overlap to 2 + 2 = 4, matching CPU count, and halves
the parallel BPE-loading work in the warm-up phase.

Tests for `getWorkerThreadCount` and `createWorkerPool` are updated to
reflect the new ratio.

Benchmark
---------
`node bin/repomix.cjs --include 'src,tests' --quiet` (258 files), n=20
paired interleaved (alternating BEFORE-first / AFTER-first ordering):

|        | min     | p25     | median  | p75     | mean    | sd     |
|--------|---------|---------|---------|---------|---------|--------|
| BEFORE | 1045 ms | 1092 ms | 1109 ms | 1122 ms | 1107 ms | 27 ms  |
| AFTER  |  937 ms |  973 ms |  991 ms | 1020 ms |  994 ms | 29 ms  |

Mean paired Δ:   +112.5 ms  (10.17 % wall-clock reduction)
Median paired Δ: +115.4 ms  (10.66 % wall-clock reduction)
Paired-delta SD: 36.2 ms  (paired t = 13.88, p < 0.001)
AFTER faster in 20/20 pairs (100 %)

Regression check — `node bin/repomix.cjs --quiet` (default, 1572 files),
n=15 paired interleaved:

|        | min     | p25     | median  | p75     | mean    | sd     |
|--------|---------|---------|---------|---------|---------|--------|
| BEFORE | 1933 ms | 1970 ms | 2016 ms | 2102 ms | 2028 ms | 62 ms  |
| AFTER  | 1955 ms | 1966 ms | 2004 ms | 2131 ms | 2034 ms | 74 ms  |

Mean paired Δ:   -6.2 ms (-0.31 %)  (paired t = -0.29, p > 0.05)
Median paired Δ: -12.7 ms (statistically neutral)

No regression on the large workload — both 100 and 200 saturate the
per-CPU cap at 4 workers for ≥800 file repos, so the dispatch-time
behavior is identical there.

Correctness
-----------
- 1256 / 1256 unit tests pass.
- `npm run lint` clean (only pre-existing warnings unrelated to this change).
- No behavioral change to file processing, tokenization, security checks,
  or output. Pool sizing is the only effect.
2026-05-07 01:11:11 +00:00
Kazuki Yamada f67731056a test: Round-3 PR review feedback
- validateFileSafety: pin the negative path of `if (config.security.enableSecurityCheck)`
  — every other test enabled the check, so a regression that always runs
  the security check would have passed silently.
- unifiedWorker:
  - Add a positive workerData=securityCheck + ambiguous-task case so the
    pair (override + this) distinguishes "inference always wins" from
    "inference wins only when it yields a value".
  - Stop pretending the handler-cache test verifies caching. Both branches
    of `if (cached) return cached;` end with the same Map.set, and Node's
    own module cache makes the dynamic import effectively free, so the
    cache is unobservable from outside without exposing internals.
    Renamed to "repeated calls" with a comment explaining the limitation.
- fileSystemReadDirectoryTool: translate the pre-existing Japanese comment
  to English per CLAUDE.md.
- TokenCounter: extract `LoadEncodingFn` type alias instead of the
  unusual `typeof loadEncoding`, so a signature drift between the local
  function and the deps field would surface at the type level.
2026-04-26 22:47:21 +09:00
Kazuki Yamada e5f7a1f311 fix(shared): Address PR review feedback
- shared/errorHandle: recognize duck-typed OperationCancelledError from
  worker boundaries in isRepomixError (it extends RepomixError but the
  name was missing from the structured-clone fallback comparison).
  Add a regression test for the worker-boundary case.

Test improvements per coderabbit / claude review:
- cliReport: assert skill-directory + relative path on the same log line.
- processConcurrency: restore process.versions.bun by removing the property
  when it didn't originally exist, instead of leaving it defined-as-undefined.
- logger: drop the no-op `process.env.REPOMIX_LOG_LEVEL = undefined` (it
  coerces to the string "undefined" and is overwritten by the next delete).
- unifiedWorker: replace the tautological cache test with one that proves
  cache uniqueness via onWorkerTermination cleanup count; add a test for
  task-based inference overriding workerData (bundled-env reuse).
- calculateMetricsWorker: new direct test for the default export's items
  vs. single-mode dispatch — unifiedWorker mocks this module so the branch
  was otherwise untested.
- packRemoteRepositoryTool: hard-code the expected output path instead of
  expect.any(String) to catch arg-swap regressions.
- memoryUtils: tighten getMemoryStats assertions with sanity bounds
  (heapUsed <= heapTotal, rss > 0, heapUsagePercent <= 100) so a
  unit-conversion regression (bytes vs MB) would fail the test.
2026-04-26 22:20:42 +09:00
Kazuki Yamada 9aac452504 test: Raise overall coverage from 87.9% to 90.1%
Cover previously-untested paths across the shared, cli, core, and mcp
layers, focusing on branches that represent real user-facing behavior
rather than line-coverage chasing.

Highlights:
- shared/errorHandle: cover handleError (RepomixError, unexpected Error,
  unknown values, duck-typed worker errors, debug-level branches) and
  the three error class constructors.
- shared/logger: cover setLogLevelByWorkerData for env-var, workerData
  (array and object shapes), and invalid/missing inputs.
- shared/memoryUtils: add a fresh test file covering stats, log helpers,
  and withMemoryLogging success/error paths.
- shared/processConcurrency: cover cleanupWorkerPool (Node, Bun-skip,
  swallowed teardown errors) and the run/cleanup delegation.
- shared/unifiedWorker: cover the cache-hit path and the workerData
  (array/object) and REPOMIX_WORKER_TYPE detection branches.
- core/metrics/TokenCounter: cover the catch branch (Error,
  non-Error throws, with/without filePath).
- core/file/fileManipulate: cover removeEmptyLines on inherited base
  and composite manipulators.
- cli/cliReport: cover skill-directory and split-output summary lines.
- mcp/tools/packRemoteRepositoryTool: add tests mirroring the
  packCodebaseTool pattern (success, runCli failure, runCli throw,
  workspace creation failure).
- mcp/tools/fileSystemReadDirectoryTool: switch to mocking
  node:fs/promises so existing mocks actually intercept calls, and
  cover the file-vs-dir, listing, empty-directory, and readdir-error
  paths.

Result:
- Statements 87.29% -> 89.51%
- Branches   76.16% -> 79.31%
- Functions  87.60% -> 89.37%
- Lines      87.89% -> 90.06%
2026-04-26 19:28:09 +09:00
Kazuki Yamada c2059ff90f refactor(shared): Address PR review nitpicks on asyncMap
intent(asyncMap): tighten doc and test based on coderabbitai review feedback
decision(asyncMap-doc): explicitly note that workers are not cooperatively cancelled after a rejection — sibling workers keep claiming indices
decision(asyncMap-test): replace timing-sensitive `peakActive > 1` with exact `=== 4` — workers spawn synchronously via Promise.all so the cap is hit deterministically

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:53:44 +09:00
Kazuki Yamada 3ddc446143 refactor(core): Bound concurrency for empty-directory readdir checks
intent(empty-dir-check): protect very large repos from FD exhaustion that unbounded Promise.all could trigger
rejected(p-limit): user wants to keep dependencies minimal — built a small in-tree helper instead
decision(asyncMap): single mapWithConcurrency helper rather than a p-limit-style limiter object — only call site is array map
decision(concurrency-limit): 20 in flight — well above libuv default thread pool (4) while still bounded for users who tuned UV_THREADPOOL_SIZE

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:41:38 +09:00
Kazuki Yamada 85b0c7a7fb fix(shared): Clean up root-level validation error formatting
intent(readability): an issue with no `path` (e.g., root-level schema mismatch) previously rendered as `[] message`; emit just `message` when segments are empty. Small quality-of-life for error output
intent(limitation-pin): add an integration test that documents the known ESM unwrap ambiguity — a CJS module shaped like `{ default: {...}, otherKey: ... }` has `otherKey` silently dropped by our heuristic. Non-issue for RepomixConfig today, but worth freezing so the behavior can't drift without someone noticing
2026-04-18 23:23:53 +09:00
Kazuki Yamada 7c2e8791a6 fix(config): Address latest PR review feedback
intent(interop-consolidation): drop `interopDefault: true` from the jiti setup in configLoad — the explicit ESM namespace unwrap at the call site already handles every module-format case we test (.ts / .mts / .js / .mjs / .cjs). Having both the jiti flag and the manual unwrap was redundant and made the intent fuzzier
intent(error-path-cleanliness): filter out empty path segments before joining in rethrowValidationErrorIfSchemaError — a malformed ValiError item (object without `key`) would otherwise produce `[output..style]`; dropping the empty entry keeps the path readable. Added a dedicated test covering the filter
2026-04-18 17:19:33 +09:00
Kazuki Yamada 1e8a5e5e1e fix(config): Address PR review feedback
intent(error-handle): drop the instanceof Error guard in rethrowValidationErrorIfSchemaError so ValiError / ZodError round-tripped through a worker (plain { name, message, issues }) is still recognized — aligns with isError / isRepomixError elsewhere in the file
intent(schema-parity): restore splitOutput's upper bound (Number.MAX_SAFE_INTEGER) in the generated JSON schema so editor hints match the previous zod output; also strip the empty required:[] arrays that @valibot/to-json-schema emits on every object node
intent(esm-unwrap): only unwrap jiti's .default when it's an object, preserving a CJS config that legitimately exports { default: 'plain', ...rest }; plain Symbol.toStringTag === 'Module' was too narrow — jiti returns non-Module namespace wrappers for .ts / .mts files
intent(test-coverage): add tests/shared/errorHandle.test.ts covering the Zod + Valibot + worker-serialized paths through rethrowValidationErrorIfSchemaError; tighten the default-schema assertion in configSchema.test.ts from /expected|invalid/i to toThrow(v.ValiError) + targeted message pattern
decision(path-segment-fallback): return '' for unknown object-shaped path items instead of falling into String(segment) → "[object Object]"; defensive, non-breaking
2026-04-18 15:47:11 +09:00
Kazuki Yamada 2a879425a0 perf(core): Reduce worker thread contention for faster pipeline execution
Add maxWorkerThreads option to WorkerOptions for explicit thread count
capping, then use it to reduce CPU contention when metrics and security
worker pools run concurrently during the pipeline overlap phase.

- Metrics pool: capped at (processConcurrency - 1)
- Security pool: capped at floor(processConcurrency / 2)

On a 4-core machine this reduces concurrent threads from 8 (4+4) to 5
(3+2), avoiding context-switching overhead during gpt-tokenizer warmup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:48:48 +09:00
Kazuki Yamada 1c99f9617b perf(security): Batch security check tasks to reduce IPC overhead 2026-04-03 23:42:52 +09:00
Kazuki Yamada b1980710d4 refactor(cli): Remove unused defaultActionWorker and its references
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:18:06 +09:00
Kazuki Yamada 05f11f46c7 refactor(core): Remove unused fileCollect worker infrastructure
File collection was replaced with a promise pool approach in 96ff05dc,
but the worker-related code remained. This removes the now-unused
fileCollectWorker and all references to it from the worker system.
2026-02-17 23:09:18 +09:00
Kazuki Yamada a94ce0f2ff fix(tests): Update test mocks for vitest v4 compatibility
Vitest v4 changed how vi.fn() and vi.mock() work with class constructors.
Arrow functions in mockImplementation no longer work as constructors
when called with 'new' keyword.

Changes:
- Use regular function syntax instead of arrow functions for constructor mocks
- Use vi.hoisted() to define class mocks that can be used in vi.mock() factories
- Replace vi.fn().mockReturnValue() with vi.fn().mockImplementation() for class mocks
- Update mock instance retrieval to use vi.mocked().mock.results[0].value
2026-01-03 16:28:31 +09:00
Kazuki Yamada f79232f81f test(worker): Add unified worker tests and unify async signatures
- Add comprehensive tests for unifiedWorker.ts covering task inference
  and worker termination cleanup
- Unify onWorkerTermination to async signature across all worker files
  for consistency (fileCollect, securityCheck, calculateMetrics)
2026-01-01 00:41:28 +09:00
Kazuki Yamada f8fedf0c87 refactor(worker): Remove debug logging and unused exports
Remove code that was added for debugging during development:
- Remove unused isTinypoolWorker function from unifiedWorker.ts
- Remove REPOMIX_DEBUG_WORKER logging from unifiedWorker.ts
- Remove debug logging from defaultActionWorker.ts
- Remove unused getUnifiedWorkerPath export
- Update tests to use workerType instead of workerPath
2025-12-31 22:38:55 +09:00
Kazuki Yamada 5715b2e58e feat(worker): Add unified worker entry point for bundling support
Add a unified worker entry point that enables full bundling support by
allowing bundled files to spawn workers using themselves. This is a
prerequisite for bundling the website server to improve Cloud Run cold
start times.

Changes:
- Add src/shared/unifiedWorker.ts as single entry point for all workers
- Support both worker_threads and child_process runtimes
- Add REPOMIX_WORKER_TYPE env var for child_process worker type detection
- Add REPOMIX_WORKER_PATH env var for bundled environment worker path
- Add REPOMIX_WASM_DIR env var for WASM file location override
- Update processConcurrency.ts to use unified worker path
- Add debug logging (REPOMIX_DEBUG_WORKER=1) for worker troubleshooting
- Export unified worker handler from main index.ts

Note: This is work in progress. There's a known issue with child_process
runtime where nested worker pools (created inside a worker) may receive
incorrect REPOMIX_WORKER_TYPE environment variable, causing task routing
issues. Investigation ongoing.
2025-12-31 22:05:06 +09:00
Kazuki Yamada dd7717bb8d feat(shared): Support decimal values in size parsing
Allow decimal size values like '2.5mb' or '1.5kb' in parseHumanSizeToBytes.
This enables more flexible size configuration for split output.
2025-12-21 21:56:39 +09:00
Kazuki Yamada 222723043a fix(cli): Address additional PR review comments
- Add test for sizeParse overflow case
- Use RepomixProgressCallback type in outputSplit.ts for consistency
- Improve configuration.md description for splitOutput option
2025-12-21 21:56:39 +09:00
Dango233 e51d77a7c6 feat(cli): Add --split-output option
Adds a size-based output splitter via --split-output (kb/mb) and writes numbered parts without splitting within a top-level folder.

Also updates metrics aggregation for multi-part output and adds unit tests.
2025-12-21 21:56:39 +09:00
Kazuki Yamada 681e377361 refactor(shared): improve error handling and cleanup code
- Use class names for RepomixError type checking instead of hardcoded strings
- Remove unused RepomixError import from fileProcess.ts
- Simplify comments in errorHandle.ts and fileProcess.ts
- Clean up constructor-based error checking logic
2025-09-17 00:59:58 +09:00
Kazuki Yamada 09b31e3e7a refactor(core): Prioritize safety with conservative worker runtime selection
Adjust worker runtime configuration to use child_process for all potentially risky operations, prioritizing stability and isolation over performance.

- Change token-related workers to child_process for better memory isolation:
  - calculateGitDiffMetrics: child_process (was worker_threads)
  - calculateGitLogMetrics: child_process (was worker_threads)
  - calculateOutputMetrics: child_process (was worker_threads)
  - calculateSelectiveFileMetrics: child_process (was worker_threads)
- Keep file collection and globby operations as worker_threads (lower risk)
- Remove redundant memory leak risk comments for cleaner code
- Fix test cases to include required runtime parameter and teardown property
- Reorder imports in languageParser.ts for consistency

This conservative approach ensures maximum stability by isolating all token counting operations in separate processes, preventing potential memory leaks from affecting the main process.
2025-09-01 22:39:24 +09:00
Kazuki Yamada 25d65dfe7c refactor(core): Consolidate worker pool arguments into WorkerOptions interface
- Add WorkerOptions interface to combine numOfTasks, workerPath, and optional runtime
- Update createWorkerPool and initTaskRunner functions to accept WorkerOptions object
- Refactor all usage sites across file processing, metrics, and security modules
- Update corresponding test cases to use new interface

This improves type safety and makes the API more maintainable by avoiding parameter order mistakes.
2025-08-31 16:24:38 +09:00
Kazuki Yamada 8f07b63a61 feat(core): Add runtime selection support for worker pools
Add WorkerRuntime type and configurable runtime parameter to createWorkerPool and initTaskRunner functions. This allows choosing between 'worker_threads' and 'child_process' runtimes based on performance requirements.

- Add WorkerRuntime type definition for type safety
- Add optional runtime parameter to createWorkerPool with child_process default
- Add optional runtime parameter to initTaskRunner with child_process default
- Configure fileCollectWorker to use worker_threads for better performance
- Update all test files to use WorkerRuntime type
- Add comprehensive tests for runtime parameter functionality
- Maintain backward compatibility with existing code

The fileCollectWorker now benefits from worker_threads faster startup and shared memory, while other workers continue using child_process for stability.
2025-08-31 16:18:12 +09:00
Kazuki Yamada 7cfe658dca feat(metrics): Create dedicated git diff worker for improved memory efficiency
- Extract git diff token calculation into separate worker and dedicated module
- Parallelize git diff metrics calculation with other metrics computations using Promise.all
- Isolate TokenCounter usage for git diffs within child process worker to prevent memory leaks
- Add comprehensive worker cleanup with exit handler for proper resource management
- Update tests to reflect new worker-based architecture and remove direct TokenCounter mocking

Memory improvements:
- Git diff token calculation now runs in isolated child process
- Enables parallel execution of all three metrics calculations (files, output, git diff)
- Further reduces main process memory footprint by isolating heavy TokenCounter operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-27 23:53:06 +09:00
Kazuki Yamada 748ce7cead refactor(shared): Consolidate initTaskRunner implementations
Add generic initTaskRunner function to processConcurrency.ts to eliminate
duplicate initialization logic across multiple modules. This reduces code
duplication and provides consistent worker pool management with proper
type safety through generic parameters.

- Add TaskRunner<T, R> interface and initTaskRunner function
- Remove duplicate createTaskRunner wrappers from 5 modules
- Update all deps parameters to use shared initTaskRunner directly
- Maintain type safety with explicit generic type parameters
- Update corresponding test mocks to match new signature

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-24 23:47:07 +09:00
Kazuki Yamada 6eb8f27fd5 refactor(core): Replace environment variable with workerData for worker log level
Replace the environment variable approach for passing log levels to workers with Tinypool's workerData mechanism, which is more idiomatic for worker thread configuration.

Changes:
- Add setLogLevelByWorkerData() method to handle workerData-based log level setting
- Update Tinypool configuration to use workerData instead of env variables
- Update all 5 worker files to use setLogLevelByWorkerData()
- Remove unused setLogLevelByEnv function and related test mocks
- Update tests to reflect new workerData configuration

This provides better isolation and follows Node.js worker thread best practices.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-23 00:05:52 +09:00
Kazuki Yamada ae68e51a05 fix(core): optimize worker thread allocation
- Set TASKS_PER_THREAD to 100 for better balance between performance and resource usage
- Add comment explaining that worker initialization is expensive
- Update tests to match new thread allocation logic

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-23 00:05:51 +09:00
Kazuki Yamada c1edfff082 Rename initTinypool to initWorker for better semantic clarity
Updated all references throughout the codebase:
- Import statements in 5 core modules
- Function calls in file processing, metrics, and security modules
- Test mocks and descriptions
- Maintained backward compatibility and functionality

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-05 00:27:34 +09:00
Kazuki Yamada a2c5a15838 feat: migrate from Piscina to Tinypool for worker thread management
Replace Piscina with Tinypool to significantly reduce bundle size (800KB → 38KB) while maintaining full API compatibility and performance. This migration affects all worker thread pools used in file processing, security checks, and metrics calculations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-05 00:23:56 +09:00
Devin AI 59b1bfe70d Remove unnecessary unknown type casting in processConcurrency.test.ts
Co-Authored-By: Kazuki Yamada <koukun0120@gmail.com>
2025-05-12 14:32:02 +00:00
Kazuki Yamada 2f2035eab3 test(core): fix tests for logger loglevel sync in worker processes 2025-04-20 15:01:20 +09:00
pranshugupta01 ed13f1f0ff lint fix 2025-04-07 02:12:15 +05:30
pranshugupta01 6d7e4edc38 test(patternUtils): add unit tests for splitPatterns function 2025-04-07 01:52:47 +05:30
yamadashy 3f8dff6694 test: Increase coverage of some tests 2025-03-02 21:45:12 +09:00
Yamada Dev 0b84b97675 feat(cli): Add quiet mode option 2025-02-11 18:02:03 +09:00
Yamada Dev c04554e59f feat(cli): Control verbose as a log level 2025-02-11 17:48:52 +09:00
Kazuki Yamada 6c9a149eb5 feat(pack): Simplify various processes 2025-01-25 13:55:19 +09:00
Kazuki Yamada c51435e24a refact: Migrate from ESLint and Prettier to Biome 2024-08-26 23:54:23 +09:00
Kazuki Yamada 88dd0c8eb2 refact: Refactor logging and improve test coverage 2024-08-12 16:41:37 +09:00