Commit Graph

253 Commits

Author SHA1 Message Date
Pavol Vaskovic
84bf15836d [benchmark] Doctor explicitly measures memory
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
2018-09-06 18:21:50 +02:00
Pavol Vaskovic
be39c02001 [benchmark] Refactor numIters computation
The spaghetti if-else code was untangled into nested function that computes `iterationsPerSampleTime` and a single constant `numIters` expression that takes care of the overflow capping as well as the choice between fixed and computed `numIters` value.

The `numIters` is now computed and logged only once per benchmark measurement instead of on every sample.

The sampling loop is now just a single line. Hurrah!

Modified test to verify that the `LogParser` maintains `num-iters` derived from the `Measuring with scale` message across samples.
2018-08-31 17:17:48 +02:00
Pavol Vaskovic
0db20feda2 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-08-31 07:32:10 +02:00
Erik Eckstein
1f32935fc4 benchmarks: fix regexp for parsing code size results
Accept a '.' in the benchmark name which is used for .o and .dylib files
2018-08-27 17:07:40 -07:00
Pavol Vaskovic
13c499339b [benchmark] Class descriptions in module doctring 2018-08-23 19:59:15 +02:00
Pavol Vaskovic
49e8e692fb [benchmark] Strangle run and run_benchmarks
Moved all `run` command related functionality to `BenchmarkDriver`.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
6bddcbe9e4 [benchmark] Refactor run_benchmarks log format 2018-08-23 18:01:46 +02:00
Pavol Vaskovic
7ae5d7754c [benchmark] Report totals as a sentence
Clean up after removing bogus agregate statistics from last line of the log. It makes more sense to report the total number of executed benchmarks as a sentence that trying to fit into the format of preceding table.

Added test assertion that `run_benchmarks` return csv formatted log, as it is used to write the log into file in `log_results`.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
ef1461ca46 [benchmark] Strangle log_results
Moved `log_results` to BenchmarkDriver.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
a10b6070dd [benchmark] Refactor log_results
Added tests for `log_results` and the *space-justified-columns* format emited to stdout while logging to file.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
1d3fa87fdd [benchmark] Strangle log_results -> log_file
Moved the `log_file` path construction to the `BenchmarkDriver`.
Retired `get_*_git_*` functions.
2018-08-23 18:01:41 +02:00
Pavol Vaskovic
049ffb34b0 [benchmark] Fix parsing formatted text
The test number column in the space justified column format emmited by the Benchmark_Driver to stdout  while logging to file is right aligned, so it must handle leading whitespace.
2018-08-23 12:31:00 +02:00
Pavol Vaskovic
0d64386b53 [benchmark] Documentation improvements
Improving complience with
PEP 257 -- Docstring Conventions
https://www.python.org/dev/peps/pep-0257/
2018-08-23 11:45:43 +02:00
Pavol Vaskovic
f38e6df914 [benchmark] Doctor verifies constant memory use
This needs to be finished with function approximating normal range based on the memory used.
2018-08-20 16:52:07 +02:00
Pavol Vaskovic
06061976da [benchmark] BenchmarkDoctor checks setup overhead
Detect setup overhead in benchmark and report if it exceeds 5%.
2018-08-17 08:50:04 +02:00
Pavol Vaskovic
7725c0096e [benchmark] Measure and analyze benchmark runtimes
`BenchmarkDoctor` measures benchmark execution (using `BenchmarkDriver`) and verifies that their runtime stays under 2500 microseconds.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
ab16999e20 [benchmark] Created BenchmarkDoctor (naming)
`BenchmarkDoctor` analyzes performance tests and reports their conformance to the set of desired criteria. First two rules verify the naming convention.

`BenchmarkDoctor` is invoked from `Benchmark_Driver` with `check` aurgument.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
076415f969 [benchmark] Strangler run_benchmarks
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed.

Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
a84db83062 [benchmark] BenchmarkDriver can run tests
The `run` method on `BenchmarkDriver` invokes the test harness with specified number of iterations, samples. It supports mesuring memory use and in the verbose mode it also collects individual samples and monitors the system load by counting the number of voluntary and involuntary context switches.

Output is parsed using `LogParser` from `compare_perf_tests.py`. This makes that file a required dependency for the driver, therefore it is also copied to the bin directory during the build.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
e80165f316 [benchmark] Exclude only outliers from the top
Option to exclude the outliers only from top of the range, leaving in the outliers on the min side.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
27cc77c590 [benchmark] Exclude outliers from samples
Introduce algorithm for excluding of outliers after collecting all samples using the Interquartile Range rule.

The `exclude_outliers` method uses 1st and 3rd Quartile to compute Interquartile Range, then uses inner fences at Q1 - 1.5*IQR and Q3 + 1.5*IQR to remove samples outside this fence.

Based on experiments with collecting hundreads and thousands of samples (`num_samples`) per test with low iteration count (`num_iters`) with ~1s runtime, this rule is very effective in providing much better quality of sample population, effectively removing short environmental fluctuations that were previously averaged into the overall result (by the adaptively determined `num_iters` to run for ~1s), enlarging the reported result with these measurement errors. This technique can be used for some benchmarks, to get more stable results faster than before.

This outlier filering is employed when parsing `--verbose` test results.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
91077e3289 [benchmark] Introduced PerformanceTestSamples
* Moved the functionality to compute median, standard deviation and related statistics from `PerformanceTestResult` into `PerformanceTestSamples`.
* Fixed wrong unit in comments
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
bea35cb7c1 [benchmark] LogParser measure environment
Measure more of environment during test

In addition to measuring maximum resident set size, also extract number of voluntary and involuntary context switches from the verbose mode.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
c60e223a3b [benchmark] LogParser: tab & space delimited logs
Added support for tab delimited and formatted log output (space aligned columns as output to console by Benchmark_Driver).
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
d0cdaee798 [benchmark] LogParser support for --verbose mode
LogParser doesn’t use `csv.reader` anymore.
Parsing is handled by a Finite State Machine. Each line is matched against a set of (mutually exclusive) regular expressions that represent known states. When a match is found, corresponding parsing action is taken.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
9852e9a32a [benchmark] Extracted LogParser class 2018-08-17 00:32:04 +02:00
Pavol Vaskovic
d079607488 [benchmark] Documentation improvements 2018-08-17 00:32:04 +02:00
Pavol Vaskovic
ce39b12929 [benchmark] Strangler: BenchmarkDriver get_tests
See https://www.martinfowler.com/bliki/StranglerApplication.html for more info on the used pattern for refactoring legacy applications.

Introduced class `BenchmarkDriver` as a beginning of strangler application that will gradually replace old functions. Used it instead of `get_tests()` function in Benchmark_Driver.

The interaction with Benchmark_O is simulated through mocking. `SubprocessMock` class records the invocations of command line processes and responds with canned replies in the format of Benchmark_O output.

Removed 3 redundant lit tests that are now covered by the unit test `test_gets_list_of_all_benchmarks_when_benchmarks_args_exist`. This saves 3 seconds from test execution. Keeping only single integration test that verifies that the plumbing is connected correstly.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
69d5d5e732 [benchmark] Adding tests for BenchmarkDriver
The imports are a bit sketchy because it doesn’t have `.py` extension and they had to be hacked manually. :-/

Extracted `parse_args` from `main` and added test coverage for argument parsing.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
0b990a82a5 [benchmark] Extracted test_utils.py
Moving the `captured_output` function to own file.

Adding homegrown unit testing helper classes `Stub` and `Mock`.

The issue is that the unittest.mock was added in Python 3.3 and we need to run on Python 2.7. `Stub` and `Mock` were organically developed as minimal implementations to support the common testing patterns used on the original branch, but since I’m rewriting the commit history to provide an easily digestible narrative, it makes sense to introduce them here in one step as a custom unit testing library.
2018-08-16 20:08:34 +02:00
Pavol Vaskovic
343f284227 [benchmark] Removed legacy submit command
Also removed inused imports.
2018-08-16 20:08:34 +02:00
Pavol Vaskovic
179b12103f [benchmark] Refactor formatting responsibilities
Moved result formatting methods from `PerformanceTestResult` and `ResultComparison` to `ReportFormatter`, in order to free PTR to take more computational responsibilities in the future.
2018-08-16 17:44:59 +02:00
Erik Eckstein
edc7a0f96c benchmarks: add an option to the compare_perf_tests script to output improvements and regressions in an single table.
Instead of separate tables. Only affects git and markdown output.
2018-08-14 13:38:14 -07:00
Erik Eckstein
b4a61d7155 benchmarks: add an option to the bench_code_size script to separately report code size for the swift libraries 2018-08-14 13:38:14 -07:00
Erik Eckstein
c1ff9bcedc benchmarks: add a script to report code size of benchmark files 2018-08-07 13:16:24 -07:00
Erik Eckstein
f82da4113b benchmarks: minor comment fix 2018-08-07 13:15:23 -07:00
Erik Eckstein
89d07e5998 benchmarks: add a script to do a very fast check of which benchmarks changed.
Runs ~ 1 min to compare two benchmark builds.
2018-08-03 10:17:53 -07:00
Pavol Vaskovic
7f894268b2 [benchmark] Restore running benchmarks by numbers
Reintroduced feature lost during `BenchmarkInfo` modernization: All registered benchmarks are ordered alphabetically and assigned an index. This number can be used as a shortcut to invoke the test instead of its full name. (Adding and removing tests from the suite will naturally reassign the indices, but they are stable for a given build.)

The `--list` parameter now prints the test *number*, *name* and *tags* separated by delimiter.

The `--list` output format is modified from:
````
Enabled Tests,Tags
AngryPhonebook,[String, api, validation]
...
````
to this:
````
\#,Test,[Tags]
2,AngryPhonebook,[String, api, validation]
…
````
(There isn’t a backslash before the #, git was eating the whole line without it.)
Note: Test number 1 is Ackermann, which is marked as “skip”, so it’s not listed with the default `skip-tags` value.

Fixes the issue where running tests via `Benchmark_Driver` always reported each test as number 1. Each test is run independently, therefore every invocation was “first”. Restoring test numbers resolves this issue back to original state: The number reported in the first column when executing the tests is its ordinal number in the Swift Benchmark Suite.
2018-07-11 23:17:02 +02:00
Pavol Vaskovic
d82c996669 [benchmark] Fixed Benchmark_Driver running tests
Fixed failure in `get_tests` which depended on the removed `Benchmark_O --run-all` option for listing all test (not just the pre-commit set).

Fix: Restored the ability to run tests by ordinal number from `Benchmark_Driver` after the support for this was removed from `Benchmark_O`.

Added tests that verify output format of `Benchmark_O --list` and the support for `--skip-tags= ` option which effectively replaced the old `--run-all` option. Other tools, like `Benchmark_Driver` depend on it.
Added integration tests for the dependency between `Benchmark_Driver` and `Benchmark_O`.

Running pre-commit test set isn’t tested explicitly here. It would take too long and it is run fairly frequently by CI bots, so if that breaks, we’ll know soon enough.
2018-07-11 23:17:02 +02:00
Michael Gottesman
36e0e6949c [benchmark] Add cmake support for compiling the benchmarks standalone on Linux.
To use this, one needs to first build an installable root for swift (i.e. like
the smoke testbot does). Then use the tool ./benchmark/scripts/build_linux.py
with the appropriate locations of the build-directory, installable snapshot,
and it will build the benchmarks. (There are more arguments, just use --help).

rdar://40541972
2018-05-29 14:25:16 -07:00
Michael Gottesman
d7fc0e1170 [benchmark] Enable users of Benchmark_QuickCheck to select a specific opt level to run from the command line. 2018-03-25 10:46:13 -07:00
Karoy Lorentey
11515c1676 [benchmark] Driver: stabilize Dictionary/Set benchmarks
Disable the random hash seed while benchmarking. By its nature, it makes the number of hash collisions fluctuate between runs, adding unnecessary noise to benchmark results.

I expect we'll be able to re-enable random seeding here once we have made hash collisions cheaper -- they are currently always resolved by calling the Key's Equatable implementation, which can be expensive.
2018-03-19 17:38:29 +00:00
Michael Gottesman
30dd85d386 [benchmark] Add a new utility script called Benchmark_QuickCheck.
This benchmark script is similar to the guard malloc/runtime runner, but it only
runs the tests. The intention is that one can use this to quickly in a
multithreaded way verify that all benchmarks run successfully. In contrast, the
normal driver will run only single threaded since it is meant to test
performance, so is not able to take advantage of all cores on a system.

I wrote this quickly to verify some benchmark tests still worked. No point in
not sharing with everyone else.
2018-03-17 11:39:52 -07:00
Michael Gottesman
70251a3ff8 Add a 10 minute timeout to individual tests.
On the bots, we have a timeout without output of 60 minutes for the entire test.
This should ensure that we are able to kill mis-behaving tests and give a good
error instead of just getting a jenkins timeout error.

For those confused, this is for the guard malloc/leaks test.

rdar://36874229
2018-01-29 09:33:11 -08:00
Arnold Schwaighofer
a921bfa044 Benchmark_RuntimeLeaksRunner: don't use non-existing --run-all flag
rdar://34149935
2017-10-13 07:44:46 -07:00
Erik Eckstein
45a2ae48ce benchmarks: replace the Ounchecked build with an Osize build
We don't measure Ounchecked anymore. On the other hand we want to benchmark the Osize build.
2017-10-06 14:09:43 -07:00
Michael Gottesman
96bc70d6ad [benchmark] Update perf_test_driver for benchmark driver updates.
This ensures that Benchmark_GuardMalloc, Benchmark_RuntimeLeaksRunner, etc. all
support the new way benchmark --list outputs benchmark names.
2017-10-02 22:06:47 -07:00
Michael Gottesman
dfc780a744 [benchmark][driver] Teach the Benchmark_Driver how to parse ./Benchmark_O{,none} --list now that tags are output as well. 2017-09-27 19:14:52 -07:00
Pavol Vaskovic
9c51a48917 Fix: Run benchmarks just once 2017-06-27 18:47:35 +02:00
Luke Larson
6944c20e13 Benchmark_Driver: Support custom baseline branches
Support specifying a baseline branch to compare the current results
against. Previously, the master branch was hardcoded.

Fixes: rdar://problem/32751587
2017-06-14 15:28:52 -07:00