Improve inline headers in `single_table` mode to also print labels for the numeric columns.
Sections in the `single_table` are visually distinguished by a separator row preceding the the inline headers.
Separated header label styles for git and markdown modes with UPPERCASE and **Bold** formatting respectively.
Inlined section template definitions.
Now, run_smoke_bench runs the benchmarks, compares performance and code size and reports the results - on stdout and as a markdown file.
No need to run bench_code_size.py and compare_perf_tests.py separately.
This has two benefits:
- It's much easier to run it locally
- It's now more transparent what's happening in '@swiftci benchmark', because now all the logic is in run_smoke_bench rather than in the not visible script on the CI bot.
I also remove the branch-arguments from ReportFormatter in ompare_perf_tests.py. They were not used anyway.
For a smooth rollout in CI, I created a new script rather than changing the existing one. Once everything is setup in CI, I'll delete the old run_smoke_test.py and bench_code_size.py.
Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.
When num_samples is less than quantile + 1, some of the measurements are repeated in the report summary. Parsed samples should strive to be a true reflection of the measured distribution, so we’ll correct this by discarding the repetated artifacts from quantile estimation.
This avoids introducting a bias from this oversampling into the empirical distribution obtained from merging independent samples.
See also:
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.
It trully is the 3rd of the 2 hard things in computer science!
Since the results comparisons are now used to also compare code sizes in addition to runtimes, it makes sense to rename the column label to the more neutral term “ratio” instead of old “speedup”.
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.
It trully is the 3rd of the 2 hard things in computer science!
The test number column in the space justified column format emmited by the Benchmark_Driver to stdout while logging to file is right aligned, so it must handle leading whitespace.
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed.
Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
Introduce algorithm for excluding of outliers after collecting all samples using the Interquartile Range rule.
The `exclude_outliers` method uses 1st and 3rd Quartile to compute Interquartile Range, then uses inner fences at Q1 - 1.5*IQR and Q3 + 1.5*IQR to remove samples outside this fence.
Based on experiments with collecting hundreads and thousands of samples (`num_samples`) per test with low iteration count (`num_iters`) with ~1s runtime, this rule is very effective in providing much better quality of sample population, effectively removing short environmental fluctuations that were previously averaged into the overall result (by the adaptively determined `num_iters` to run for ~1s), enlarging the reported result with these measurement errors. This technique can be used for some benchmarks, to get more stable results faster than before.
This outlier filering is employed when parsing `--verbose` test results.
* Moved the functionality to compute median, standard deviation and related statistics from `PerformanceTestResult` into `PerformanceTestSamples`.
* Fixed wrong unit in comments
Measure more of environment during test
In addition to measuring maximum resident set size, also extract number of voluntary and involuntary context switches from the verbose mode.
LogParser doesn’t use `csv.reader` anymore.
Parsing is handled by a Finite State Machine. Each line is matched against a set of (mutually exclusive) regular expressions that represent known states. When a match is found, corresponding parsing action is taken.
Moved result formatting methods from `PerformanceTestResult` and `ResultComparison` to `ReportFormatter`, in order to free PTR to take more computational responsibilities in the future.
Coverage at 99% according to coverage.py
* `compare_perf_tests.py` now always outputs the same format to stdout as is written to `--output` file
* Added integration test for the main() function
* Added tests for console output (and suppressed it leaking during testing)
* Fixed file name in test’s file header