Multimodal benchmarks with significant delta between the modes can report false performance changes when we gather too few independent samples. This increases the minimal number of independent samples from 5 to 10.
Fix for https://bugs.swift.org/browse/SR-9907
Remove the `get_results` function, which is no longer used after the refactoring that rebased the benchmark measurements on `BenchmarDriver` class in #21684.
Refactored `test_perfomance` function to use existing BenchmarkDriver and TestComparator.
This replaces hand-rolled parser and comparison logic with library functions which already have full unit test coverage.
Now, run_smoke_bench runs the benchmarks, compares performance and code size and reports the results - on stdout and as a markdown file.
No need to run bench_code_size.py and compare_perf_tests.py separately.
This has two benefits:
- It's much easier to run it locally
- It's now more transparent what's happening in '@swiftci benchmark', because now all the logic is in run_smoke_bench rather than in the not visible script on the CI bot.
I also remove the branch-arguments from ReportFormatter in ompare_perf_tests.py. They were not used anyway.
For a smooth rollout in CI, I created a new script rather than changing the existing one. Once everything is setup in CI, I'll delete the old run_smoke_test.py and bench_code_size.py.