Commit Graph

78 Commits

Author SHA1 Message Date
Pavol Vaskovic
691007b029 [benchmark] LogParser: Accept -?! in bench. names
Extend parser to support benchmark names that include `-?!` in names, to fully support the new Naming Convention from PR #20334.
2019-02-19 23:31:58 +01:00
Erik Eckstein
040aa06fec benchmarks: combine everything which is needed into run_smoke_bench
Now, run_smoke_bench runs the benchmarks, compares performance and code size and reports the results - on stdout and as a markdown file.
No need to run bench_code_size.py and compare_perf_tests.py separately.

This has two benefits:
- It's much easier to run it locally
- It's now more transparent what's happening in '@swiftci benchmark', because now all the logic is in run_smoke_bench rather than in the not visible script on the CI bot.

I also remove the branch-arguments from ReportFormatter in ompare_perf_tests.py. They were not used anyway.

For a smooth rollout in CI, I created a new script rather than changing the existing one. Once everything is setup in CI, I'll delete the old run_smoke_test.py and bench_code_size.py.
2018-11-01 16:41:39 -07:00
Pavol Vaskovic
897b9ef82e [benchmark] Gardening: Fix linter nitpicks 2018-10-27 06:15:23 +02:00
Pavol Vaskovic
397c44747b [benchmark] Exclude outliers from sample
Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.
2018-10-11 19:48:20 +02:00
Pavol Vaskovic
0d318b6464 [benchmark] Discard oversampled quantile values
When num_samples is less than quantile + 1, some of the measurements are repeated in the report summary. Parsed samples should strive to be a true reflection of the measured distribution, so we’ll correct this by discarding the repetated artifacts from quantile estimation.

This avoids introducting a bias from this oversampling into the empirical distribution obtained from merging independent samples.

See also:
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
61a092a695 [benchmark] LogParser delta quantiles support
Support for reading delta-encoded quantiles format.
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
012e07cdd2 [benchmark] LogParser support for quantile format
Gather all samples published in the benchamark summary from the `Benchmark_O --quantile` output format.
2018-10-09 15:52:28 +02:00
Pavol Vaskovic
9ba571f641 [benchmark] Parse yield timings from verbose log 2018-10-09 09:52:14 +02:00
Pavol Vaskovic
0f25849f8c [benchmark] Parse setup time from verbose log 2018-10-09 09:52:06 +02:00
Pavol Vaskovic
78159e1fe3 [benchmark] Fix drop mean and sd on merge 2018-10-09 09:50:45 +02:00
Pavol Vaskovic
f729b8e623 [benchmark] Fix merging max_rss when None 2018-10-06 11:43:56 +02:00
Pavol Vaskovic
a9f0ce4338 [benchmark] Fix quantile estimation type
The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3.  It's the inverse of empirical distribution function.

References:
* https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
* discussion in https://github.com/apple/swift/pull/19097#issuecomment-421238197
2018-09-20 09:19:07 +02:00
Ben Langmuir
541c48f9e4 Merge pull request #19328 from palimondo/test-twice-commit-once
[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening (take 2)
2018-09-17 11:54:08 -07:00
Pavol Vaskovic
f0e7b8737a [benchmark] Round quantile idx to nearest or even
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See:
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
e48b5fdb34 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-09-14 23:40:43 +02:00
Ben Langmuir
423e145b0c Revert "[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening" 2018-09-14 13:24:01 -07:00
Pavol Vaskovic
2ad8bf732a [benchmarks] Rename column label SPEEDUP to RATIO
Since the results comparisons are now used to also compare code sizes in addition to runtimes, it makes sense to rename the column label to the more neutral term “ratio” instead of old “speedup”.
2018-09-13 22:00:52 +02:00
Pavol Vaskovic
a56c55c8e4 [benchmark] Round quantile idx to nearest or even
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See:
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
2018-09-10 10:45:00 +02:00
Pavol Vaskovic
0db20feda2 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-08-31 07:32:10 +02:00
Erik Eckstein
1f32935fc4 benchmarks: fix regexp for parsing code size results
Accept a '.' in the benchmark name which is used for .o and .dylib files
2018-08-27 17:07:40 -07:00
Pavol Vaskovic
049ffb34b0 [benchmark] Fix parsing formatted text
The test number column in the space justified column format emmited by the Benchmark_Driver to stdout  while logging to file is right aligned, so it must handle leading whitespace.
2018-08-23 12:31:00 +02:00
Pavol Vaskovic
0d64386b53 [benchmark] Documentation improvements
Improving complience with
PEP 257 -- Docstring Conventions
https://www.python.org/dev/peps/pep-0257/
2018-08-23 11:45:43 +02:00
Pavol Vaskovic
076415f969 [benchmark] Strangler run_benchmarks
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed.

Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
e80165f316 [benchmark] Exclude only outliers from the top
Option to exclude the outliers only from top of the range, leaving in the outliers on the min side.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
27cc77c590 [benchmark] Exclude outliers from samples
Introduce algorithm for excluding of outliers after collecting all samples using the Interquartile Range rule.

The `exclude_outliers` method uses 1st and 3rd Quartile to compute Interquartile Range, then uses inner fences at Q1 - 1.5*IQR and Q3 + 1.5*IQR to remove samples outside this fence.

Based on experiments with collecting hundreads and thousands of samples (`num_samples`) per test with low iteration count (`num_iters`) with ~1s runtime, this rule is very effective in providing much better quality of sample population, effectively removing short environmental fluctuations that were previously averaged into the overall result (by the adaptively determined `num_iters` to run for ~1s), enlarging the reported result with these measurement errors. This technique can be used for some benchmarks, to get more stable results faster than before.

This outlier filering is employed when parsing `--verbose` test results.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
91077e3289 [benchmark] Introduced PerformanceTestSamples
* Moved the functionality to compute median, standard deviation and related statistics from `PerformanceTestResult` into `PerformanceTestSamples`.
* Fixed wrong unit in comments
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
bea35cb7c1 [benchmark] LogParser measure environment
Measure more of environment during test

In addition to measuring maximum resident set size, also extract number of voluntary and involuntary context switches from the verbose mode.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
c60e223a3b [benchmark] LogParser: tab & space delimited logs
Added support for tab delimited and formatted log output (space aligned columns as output to console by Benchmark_Driver).
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
d0cdaee798 [benchmark] LogParser support for --verbose mode
LogParser doesn’t use `csv.reader` anymore.
Parsing is handled by a Finite State Machine. Each line is matched against a set of (mutually exclusive) regular expressions that represent known states. When a match is found, corresponding parsing action is taken.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
9852e9a32a [benchmark] Extracted LogParser class 2018-08-17 00:32:04 +02:00
Pavol Vaskovic
d079607488 [benchmark] Documentation improvements 2018-08-17 00:32:04 +02:00
Pavol Vaskovic
179b12103f [benchmark] Refactor formatting responsibilities
Moved result formatting methods from `PerformanceTestResult` and `ResultComparison` to `ReportFormatter`, in order to free PTR to take more computational responsibilities in the future.
2018-08-16 17:44:59 +02:00
Erik Eckstein
edc7a0f96c benchmarks: add an option to the compare_perf_tests script to output improvements and regressions in an single table.
Instead of separate tables. Only affects git and markdown output.
2018-08-14 13:38:14 -07:00
Erik Eckstein
45a2ae48ce benchmarks: replace the Ounchecked build with an Osize build
We don't measure Ounchecked anymore. On the other hand we want to benchmark the Osize build.
2017-10-06 14:09:43 -07:00
Pavol Vaskovic
e7b243cad7 Fixed false statement in documentation. 2017-06-04 18:40:20 +02:00
Pavol Vaskovic
dea7d8fe77 Consistent --output; Improved coverage: main()
Coverage at 99% according to coverage.py

* `compare_perf_tests.py` now always outputs the same format to stdout as is written to `--output` file
* Added integration test for the main() function
* Added tests for console output (and suppressed it leaking during testing)
* Fixed file name in test’s file header
2017-06-04 18:31:06 +02:00
Pavol Vaskovic
9265a71ac6 Improved coverage: ReportFormatter
Coverage at 87% according to coveragy.py

Also fixed spelling errors in documentation.
2017-06-02 02:28:44 +02:00
Pavol Vaskovic
d178b6e0cd Improved coverage with more tests: parse_args
Coverage at 66% according to coveragy.py
2017-06-01 22:19:33 +02:00
Pavol Vaskovic
49ddd96c83 Added documentation and test coverage.
compare_perf_test.py is now covered with unit tests and public methods are documented in the implementation.

Minor refactoring  to better conform to Python conventions:
* classes declared in new style
* proper private method prefix of single underscore
* replacing map with list comprehension where it was clearer

Unit test are executed as part of validation-test.

.gitignore was modified to ignore .coverage and htmlcov artifacts generated by the coverage.py package
2017-06-01 20:05:40 +02:00
practicalswift
659b415462 [gardening] Fix typo. 2017-05-11 16:04:36 +02:00
practicalswift
49ed8579c4 [gardening] Use American English. 2017-05-09 20:44:30 +02:00
practicalswift
cc6a160d91 [gardening] Remove unused Python property cv 2017-05-04 15:21:45 +02:00
Pavol Vaskovic
c719818024 Fix SR-4601 Report Added and Removed Benchmarks in Performance Comparison (#8991)
* Refactor compare_perf_tests.py

* Fix SR-4601 Report Added and Removed Benchmarks in Performance Comparison

Improved HTML styling.

* Added back support for reading concatenated Benchmark_Driver output

PerformanceTestResults can be merged, computing new MIN, MAX, and running MEAN and SD.

* Handle output from Benchmark_O again

Treat MAX_RSS as optional column
2017-04-26 12:09:18 -07:00
swift-ci
a67c6d00f5 Merge pull request #8923 from moiseev/reverse-improvements 2017-04-24 17:06:07 -07:00
Max Moiseev
9aa1e61851 Sorting performance by delta instead of ratio
...to avoid problems with rounding.
2017-04-24 14:34:38 -07:00
Andrew Trick
f5410be16b compare_perf_tests.py: fix column header formatting.
Column names must not contain spaces for tools that auto-format the table.
The extra "(%)" was completely redundant since every value in the column
reads as a percentage.
2017-04-22 10:16:17 -07:00
Max Moiseev
c4dc74b9b6 Fixing the crash in compare_perf_tests
Comparisons should only be performed on the intersection of test lists,
otherwise it would crash should the new benchmark be introduced.
2017-04-21 14:40:38 -07:00
Max Moiseev
d878b45e29 Reverse the order of improvements in the output of compare_perf_tests
Both regressions and improvements are sorted by the delta, which in case
of improvements produces the reversed order due to negative values of
delta.
This change makes the improvements ordered 'naturally':
most-improved-first.
2017-04-21 13:59:02 -07:00
Hugh Bellamy
4f23d61da0 Import print_function wherever we use print() in python code 2017-02-20 11:11:27 +07:00
practicalswift
6d1ae2a39c [gardening] 2016 → 2017 2017-01-06 16:41:22 +01:00