The way we already gather numbers for this test is that we run two runs of
`Benchmark_O $TEST` with num-samples=2, iters={2,3}. Under the assumption that
the only difference in counter numbers can be caused by that extra iteration,
subtracting the group of counts for 2,3 gives us the number of counts in that
iteration.
In certain cases, I have found that a small subset of the benchmarks are
producing weird output and I haven't had the time to look into why. That being
said, I do know what these weird results look like, so in this commit we do some
extra validation work to see if we need to fail a test due to instability.
The specific validation is that:
1. We perform another run with num-samples=2, iter=5 and subtract the iter=3
counts from that. Under the assumption that overall work should increase
linearly with iteration size in our benchmarks, we check if the counts are
actual 2x.
2. If either `result[iter=3] - result[iter=2]` or `result[iter=5] -
result[iter=3]` is negative. All of the counters we gather should never decrease
with iteration count.
Otherwise, one can get results that seem to imply more rr traffic when in
reality, one was not tracking {retain,release}_n that as a result of better
optimization become just simple retain, release.
Improve inline headers in `single_table` mode to also print labels for the numeric columns.
Sections in the `single_table` are visually distinguished by a separator row preceding the the inline headers.
Separated header label styles for git and markdown modes with UPPERCASE and **Bold** formatting respectively.
Inlined section template definitions.
This will let me:
1. Add -Osize support easily.
2. Put all of the binaries in the same directory so that Benchmark_Driver can
work with them via the -tools argument.
Multimodal benchmarks with significant delta between the modes can report false performance changes when we gather too few independent samples. This increases the minimal number of independent samples from 5 to 10.
Fix for https://bugs.swift.org/browse/SR-9907
Remove the `get_results` function, which is no longer used after the refactoring that rebased the benchmark measurements on `BenchmarDriver` class in #21684.
Adds a `create_benchmark` script that automates the following three tasks:
1. Add a new Swift file (YourTestNameHere.swift), built according to the template below, to the {{single-source}}directory.
2. Add the filename of the new Swift file to CMakeLists.txt
3. Edit main.swift. Import and register your new Swift module.
The process of adding new benchmarks is now automated and a lot less error-prone.
For really small runtimes < 20 μs this method of setup overhead detection doesn’t work. Even 1μs change in 20μs runtime is 5%. Just return no overhead.
Refactored `test_perfomance` function to use existing BenchmarkDriver and TestComparator.
This replaces hand-rolled parser and comparison logic with library functions which already have full unit test coverage.
Finished support for running all active tests in one batch. Returns a dictionary of PerformanceTestResults.
Known test names are passed to the harness in a compressed form as test numbers.
Lowered the default sample cap from 2k to 200. (This doesn’t effect manually specified `--num-samples` argument in the driver.)
Swift benchmarks have pretty constant performance profile over time. It’s more beneficial to get multiple independent measurements faster, than more samples from the same run.