Commit Graph

78 Commits

Author SHA1 Message Date
Pavol Vaskovic
84e7d4dfb8 [benchmark] Adjust Driver’s console output format
…to handle longer benchmark names, assuming maximum length of 40 characters.
2019-02-19 23:28:51 +01:00
Pavol Vaskovic
2c271493d5 [benchmark] Limit of Accuracy in Setup Overhead
Clarified limit of accuracy in setup overhead detection.
2019-01-09 18:01:06 +01:00
Pavol Vaskovic
2096151ee9 [benchmark] BenchmarkDoctor: Lower runtime limit
Warn about runtimes under 20 μs and flag 0 μs runtimes as errors.
2019-01-08 19:16:40 +01:00
Pavol Vaskovic
8a8a3ad6df [benchmark] Limit setup overhead detection (>20)
For really small runtimes < 20 μs this method of setup overhead detection doesn’t work. Even 1μs change in 20μs runtime is 5%. Just return no overhead.
2019-01-08 19:15:29 +01:00
Pavol Vaskovic
4a716445df [benchmark] BernchmarkDriver run in batch mode
Finished support for running all active tests in one batch. Returns a dictionary of PerformanceTestResults.

Known test names are passed to the harness in a compressed form as test numbers.
2019-01-07 20:59:39 +01:00
Pavol Vaskovic
df3389259b [benchmark] BenchmarkDriver: store test_numbers 2019-01-07 20:57:47 +01:00
Pavol Vaskovic
1f58ad6662 [Gardening] Better names: _tests_by_name_or_number 2019-01-07 20:57:47 +01:00
Pavol Vaskovic
3023ab5545 [benchmark] BenchmarkDriver sample_time support
Added support for Benchmark_X’s `--sample-time` parameter .
2019-01-07 20:57:42 +01:00
Pavol Vaskovic
b831f93dd4 [benchmark] BenchmarkDoctor: Optional markdown arg
Don't require the presence of `markdown` argument for initialization.
(It doesn't exist when BenchmarkDoctor is used from `run_smoke_bench`.)
2018-12-21 21:26:24 +01:00
Pavol Vaskovic
46f94d7709 [benchmark] BenchmarkDriver check --markdown
Added `--markdown` flag for the `check` command to output the `BenchmarkDoctor`’s report in the Markdown format (as used by swift-ci on GitHub).
2018-12-21 01:22:38 +01:00
Andrew Trick
5154886491 Merge pull request #20334 from palimondo/within-cells-interlinked
[benchmark] Naming Convention
2018-12-13 08:20:23 -08:00
Pavol Vaskovic
9d6f7ad160 [benchmark] Driver & Doctor: Lower the sample cap
Lowered the default sample cap from 2k to 200. (This doesn’t effect manually specified `--num-samples` argument in the driver.)

Swift benchmarks have pretty constant performance profile over time. It’s more beneficial to get multiple independent measurements faster, than more samples from the same run.
2018-12-07 15:06:43 +01:00
Pavol Vaskovic
92cf40dcd3 [benchmark] MarkdownReportHandler
`logging.Handler` that creates nicely formatted report from `BecnhmarkDoctor`’s `check` in Markdown table for display on GitHub.
2018-11-27 22:55:02 +01:00
Pavol Vaskovic
9a04207735 [benchmark] Doctor: emit mem_page details info
Promoting previously DEBUG message to INFO.
2018-11-27 22:49:07 +01:00
Pavol Vaskovic
bc0064d285 [benchmark] Simpler naming convention regex 2018-11-19 10:03:33 +01:00
Pavol Vaskovic
b4f901bae4 [benchmark] Naming Convention
New benchmark naming convention for better readability and improved naming system that accounts for performance coverage growth going forward.
2018-11-05 22:44:50 +01:00
Pavol Vaskovic
a7f832fb57 [benchmark] Legacy factor
This adds optional `legacyFactor` to the `BenchmarkInfo`, which allows for linear modification of constants that unnecesarily inflate the base workload of benchmarks, while maintaining the continuity of log-term benchmark tracking.

For example, if a benchmark uses `for _ in N*10_000` in its run function, we could lower this to `for _ in N*1_000` and adding a `legacyFactor: 10` to its `BenchmarkInfo`.

Note that this doesn’t affect the real measurements gathered from the `--verbose` output. The `BenchmarkDoctor` has been slightly adjusted to work with these real samples, therefore `Benchmark_Driver check` will not flag these benchmarks for slow run time reported in the summary, if their real runtimes fall into the recommended range.
2018-11-01 06:24:27 +01:00
Pavol Vaskovic
a24d0ff7a5 [benchmark] BenchmarkDoctor checks setup time
Add a check against unreasonably long setup times for benchmarks that do their initialization work in the `setUpFunction`. Given the typical benchmark measurements will last about 1 second, it’s reasonable to expect the setup to take at most 20% extra, on top of that: 200 ms.

The `DictionaryKeysContains*` benchmarks are an instance of this mistake. The setup of `DictionaryKeysContainsNative` takes 3 seconds on my machine, to prepare a dictionary for the run function, whose typical runtime is 90 μs. The setup of Cocoa version takes 8 seconds!!! It is trivial to rewite these with much smaller dictionaries that demonstrate the point of these benchmarks perfectly well, without the need to wait for ages to setup these benchmarks.
2018-10-15 09:06:38 +02:00
Pavol Vaskovic
638f4f8e5e [benchmark] Recommended runtime should be < 1ms
* Lowered the threshold for healthy benchmark runtime to be under 1000 μs.
* Offer suitable divisor that is power of 10, in addition to the one that’s power of 2.
* Expanded the motivation in the docstring.
2018-10-13 22:09:25 +02:00
Pavol Vaskovic
d9a89ffea2 [benchmark] Use header in CSV log
Since the meaning of some columns was changed, but their overall number remained, let’s include the header in the CSV log to make it clear that we are now reporting MIN, Q1, MEDIAN, Q3, MAX, MAX_RSS, instead of the old MIN, MAX, MEAN, SD, MEDIAN, MAX_RSS format.
2018-10-12 10:03:33 +02:00
Pavol Vaskovic
a04edd1d47 [benchmark] Quantiles in Benchmark_Driver
Switching the measurement technique from gathering `i` independent samples characterized by their mean values, to a finer grained characterization of these measurements using quantiles.

The distribution of benchmark measurements is non-normal, with outliers that significantly inflate the mean and standard-deviation due to presence of uncontrolled variable of the system load. Therefore the MEAN and SD were incorrect statistics to properly characterize the benchmark measurements.

Benchmark_Driver now gathers more individual measurements from Benchmark_O. It is executed with `--num-iters=1`, because we don’t want to average the runtimes, we want raw data. This collects variable number of measurements gathered in about 1 second.  Using the `--quantile=20` we get up to 20 measured values that properly characterize the empirical distribution of the benchmark from each independent run. The measurements from `i` independent executions are combined to form the final empirical distribution, which is reported in a five-number summary (MIN, Q1, MEDIAN, Q3, MAX).
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
0438c45e2d [benchmark] B_D iterations => independent-samples
Renamed Benchmark_Driver’s `iterations` argument to `independent-samples` to clarify its true meaning and  disambiguate it from the concept of `num-iters` used in Benchmark_O. The short form of the argument — `-i` — remains unchanged.
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
9bd599a914 [benchmark] Doctor explicitly measures memory
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
2018-09-14 23:40:43 +02:00
Ben Langmuir
423e145b0c Revert "[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening" 2018-09-14 13:24:01 -07:00
Pavol Vaskovic
84bf15836d [benchmark] Doctor explicitly measures memory
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
2018-09-06 18:21:50 +02:00
Pavol Vaskovic
13c499339b [benchmark] Class descriptions in module doctring 2018-08-23 19:59:15 +02:00
Pavol Vaskovic
49e8e692fb [benchmark] Strangle run and run_benchmarks
Moved all `run` command related functionality to `BenchmarkDriver`.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
6bddcbe9e4 [benchmark] Refactor run_benchmarks log format 2018-08-23 18:01:46 +02:00
Pavol Vaskovic
7ae5d7754c [benchmark] Report totals as a sentence
Clean up after removing bogus agregate statistics from last line of the log. It makes more sense to report the total number of executed benchmarks as a sentence that trying to fit into the format of preceding table.

Added test assertion that `run_benchmarks` return csv formatted log, as it is used to write the log into file in `log_results`.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
ef1461ca46 [benchmark] Strangle log_results
Moved `log_results` to BenchmarkDriver.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
a10b6070dd [benchmark] Refactor log_results
Added tests for `log_results` and the *space-justified-columns* format emited to stdout while logging to file.
2018-08-23 18:01:46 +02:00
Pavol Vaskovic
1d3fa87fdd [benchmark] Strangle log_results -> log_file
Moved the `log_file` path construction to the `BenchmarkDriver`.
Retired `get_*_git_*` functions.
2018-08-23 18:01:41 +02:00
Pavol Vaskovic
0d64386b53 [benchmark] Documentation improvements
Improving complience with
PEP 257 -- Docstring Conventions
https://www.python.org/dev/peps/pep-0257/
2018-08-23 11:45:43 +02:00
Pavol Vaskovic
f38e6df914 [benchmark] Doctor verifies constant memory use
This needs to be finished with function approximating normal range based on the memory used.
2018-08-20 16:52:07 +02:00
Pavol Vaskovic
06061976da [benchmark] BenchmarkDoctor checks setup overhead
Detect setup overhead in benchmark and report if it exceeds 5%.
2018-08-17 08:50:04 +02:00
Pavol Vaskovic
7725c0096e [benchmark] Measure and analyze benchmark runtimes
`BenchmarkDoctor` measures benchmark execution (using `BenchmarkDriver`) and verifies that their runtime stays under 2500 microseconds.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
ab16999e20 [benchmark] Created BenchmarkDoctor (naming)
`BenchmarkDoctor` analyzes performance tests and reports their conformance to the set of desired criteria. First two rules verify the naming convention.

`BenchmarkDoctor` is invoked from `Benchmark_Driver` with `check` aurgument.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
076415f969 [benchmark] Strangler run_benchmarks
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed.

Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
a84db83062 [benchmark] BenchmarkDriver can run tests
The `run` method on `BenchmarkDriver` invokes the test harness with specified number of iterations, samples. It supports mesuring memory use and in the verbose mode it also collects individual samples and monitors the system load by counting the number of voluntary and involuntary context switches.

Output is parsed using `LogParser` from `compare_perf_tests.py`. This makes that file a required dependency for the driver, therefore it is also copied to the bin directory during the build.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
ce39b12929 [benchmark] Strangler: BenchmarkDriver get_tests
See https://www.martinfowler.com/bliki/StranglerApplication.html for more info on the used pattern for refactoring legacy applications.

Introduced class `BenchmarkDriver` as a beginning of strangler application that will gradually replace old functions. Used it instead of `get_tests()` function in Benchmark_Driver.

The interaction with Benchmark_O is simulated through mocking. `SubprocessMock` class records the invocations of command line processes and responds with canned replies in the format of Benchmark_O output.

Removed 3 redundant lit tests that are now covered by the unit test `test_gets_list_of_all_benchmarks_when_benchmarks_args_exist`. This saves 3 seconds from test execution. Keeping only single integration test that verifies that the plumbing is connected correstly.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
69d5d5e732 [benchmark] Adding tests for BenchmarkDriver
The imports are a bit sketchy because it doesn’t have `.py` extension and they had to be hacked manually. :-/

Extracted `parse_args` from `main` and added test coverage for argument parsing.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
343f284227 [benchmark] Removed legacy submit command
Also removed inused imports.
2018-08-16 20:08:34 +02:00
Pavol Vaskovic
7f894268b2 [benchmark] Restore running benchmarks by numbers
Reintroduced feature lost during `BenchmarkInfo` modernization: All registered benchmarks are ordered alphabetically and assigned an index. This number can be used as a shortcut to invoke the test instead of its full name. (Adding and removing tests from the suite will naturally reassign the indices, but they are stable for a given build.)

The `--list` parameter now prints the test *number*, *name* and *tags* separated by delimiter.

The `--list` output format is modified from:
````
Enabled Tests,Tags
AngryPhonebook,[String, api, validation]
...
````
to this:
````
\#,Test,[Tags]
2,AngryPhonebook,[String, api, validation]
…
````
(There isn’t a backslash before the #, git was eating the whole line without it.)
Note: Test number 1 is Ackermann, which is marked as “skip”, so it’s not listed with the default `skip-tags` value.

Fixes the issue where running tests via `Benchmark_Driver` always reported each test as number 1. Each test is run independently, therefore every invocation was “first”. Restoring test numbers resolves this issue back to original state: The number reported in the first column when executing the tests is its ordinal number in the Swift Benchmark Suite.
2018-07-11 23:17:02 +02:00
Pavol Vaskovic
d82c996669 [benchmark] Fixed Benchmark_Driver running tests
Fixed failure in `get_tests` which depended on the removed `Benchmark_O --run-all` option for listing all test (not just the pre-commit set).

Fix: Restored the ability to run tests by ordinal number from `Benchmark_Driver` after the support for this was removed from `Benchmark_O`.

Added tests that verify output format of `Benchmark_O --list` and the support for `--skip-tags= ` option which effectively replaced the old `--run-all` option. Other tools, like `Benchmark_Driver` depend on it.
Added integration tests for the dependency between `Benchmark_Driver` and `Benchmark_O`.

Running pre-commit test set isn’t tested explicitly here. It would take too long and it is run fairly frequently by CI bots, so if that breaks, we’ll know soon enough.
2018-07-11 23:17:02 +02:00
Karoy Lorentey
11515c1676 [benchmark] Driver: stabilize Dictionary/Set benchmarks
Disable the random hash seed while benchmarking. By its nature, it makes the number of hash collisions fluctuate between runs, adding unnecessary noise to benchmark results.

I expect we'll be able to re-enable random seeding here once we have made hash collisions cheaper -- they are currently always resolved by calling the Key's Equatable implementation, which can be expensive.
2018-03-19 17:38:29 +00:00
Erik Eckstein
45a2ae48ce benchmarks: replace the Ounchecked build with an Osize build
We don't measure Ounchecked anymore. On the other hand we want to benchmark the Osize build.
2017-10-06 14:09:43 -07:00
Michael Gottesman
dfc780a744 [benchmark][driver] Teach the Benchmark_Driver how to parse ./Benchmark_O{,none} --list now that tags are output as well. 2017-09-27 19:14:52 -07:00
Pavol Vaskovic
9c51a48917 Fix: Run benchmarks just once 2017-06-27 18:47:35 +02:00
Luke Larson
6944c20e13 Benchmark_Driver: Support custom baseline branches
Support specifying a baseline branch to compare the current results
against. Previously, the master branch was hardcoded.

Fixes: rdar://problem/32751587
2017-06-14 15:28:52 -07:00
Pavol Vaskovic
97d6f8dc5e Support for running benchmarks by ordinal number
Add support for running benchmarks by reffering to them by their ordinal number in `Benchmark_Driver`, as is supported by `Benchmark_O`(`Onone`, `Ounchecked`).

Updated documentation to reflect this.
2017-06-12 20:50:00 +02:00