The Swift benchmarking harness now has two distinct output formats:
* Default: Formatted text that's intended for human consumption.
Right now, this is just the minimum value, but we can augment that.
* `--json`: each output line is a JSON-encoded object that contains raw data
This information is intended for use by python scripts that aggregate
or compare multiple independent tests.
Previously, we tried to use the same output for both purposes. This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information. This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.
Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```
This format is easy to parse in Python. Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
j = json.loads(l)
# Default 0 if not present
max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.
Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity. Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations. In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present. Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.
This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.
For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
The `__future__` we relied on is now, where the 3 specific things are
all included [since Python 3.0](https://docs.python.org/3/library/__future__.html):
* absolute_import
* print_function
* unicode_literals
* division
These import statements are no-ops and are no longer necessary.
For really small runtimes < 20 μs this method of setup overhead detection doesn’t work. Even 1μs change in 20μs runtime is 5%. Just return no overhead.
Finished support for running all active tests in one batch. Returns a dictionary of PerformanceTestResults.
Known test names are passed to the harness in a compressed form as test numbers.
Lowered the default sample cap from 2k to 200. (This doesn’t effect manually specified `--num-samples` argument in the driver.)
Swift benchmarks have pretty constant performance profile over time. It’s more beneficial to get multiple independent measurements faster, than more samples from the same run.
This adds optional `legacyFactor` to the `BenchmarkInfo`, which allows for linear modification of constants that unnecesarily inflate the base workload of benchmarks, while maintaining the continuity of log-term benchmark tracking.
For example, if a benchmark uses `for _ in N*10_000` in its run function, we could lower this to `for _ in N*1_000` and adding a `legacyFactor: 10` to its `BenchmarkInfo`.
Note that this doesn’t affect the real measurements gathered from the `--verbose` output. The `BenchmarkDoctor` has been slightly adjusted to work with these real samples, therefore `Benchmark_Driver check` will not flag these benchmarks for slow run time reported in the summary, if their real runtimes fall into the recommended range.
Add a check against unreasonably long setup times for benchmarks that do their initialization work in the `setUpFunction`. Given the typical benchmark measurements will last about 1 second, it’s reasonable to expect the setup to take at most 20% extra, on top of that: 200 ms.
The `DictionaryKeysContains*` benchmarks are an instance of this mistake. The setup of `DictionaryKeysContainsNative` takes 3 seconds on my machine, to prepare a dictionary for the run function, whose typical runtime is 90 μs. The setup of Cocoa version takes 8 seconds!!! It is trivial to rewite these with much smaller dictionaries that demonstrate the point of these benchmarks perfectly well, without the need to wait for ages to setup these benchmarks.
* Lowered the threshold for healthy benchmark runtime to be under 1000 μs.
* Offer suitable divisor that is power of 10, in addition to the one that’s power of 2.
* Expanded the motivation in the docstring.
Since the meaning of some columns was changed, but their overall number remained, let’s include the header in the CSV log to make it clear that we are now reporting MIN, Q1, MEDIAN, Q3, MAX, MAX_RSS, instead of the old MIN, MAX, MEAN, SD, MEDIAN, MAX_RSS format.
Switching the measurement technique from gathering `i` independent samples characterized by their mean values, to a finer grained characterization of these measurements using quantiles.
The distribution of benchmark measurements is non-normal, with outliers that significantly inflate the mean and standard-deviation due to presence of uncontrolled variable of the system load. Therefore the MEAN and SD were incorrect statistics to properly characterize the benchmark measurements.
Benchmark_Driver now gathers more individual measurements from Benchmark_O. It is executed with `--num-iters=1`, because we don’t want to average the runtimes, we want raw data. This collects variable number of measurements gathered in about 1 second. Using the `--quantile=20` we get up to 20 measured values that properly characterize the empirical distribution of the benchmark from each independent run. The measurements from `i` independent executions are combined to form the final empirical distribution, which is reported in a five-number summary (MIN, Q1, MEDIAN, Q3, MAX).
Renamed Benchmark_Driver’s `iterations` argument to `independent-samples` to clarify its true meaning and disambiguate it from the concept of `num-iters` used in Benchmark_O. The short form of the argument — `-i` — remains unchanged.
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
Small fix following the last refactorig of MAX_RSS, the `--memory` option is required to measure memory in `--verbose` mode. Added integration test for `check` command of Benchmark_Driver that depended on it.
Clean up after removing bogus agregate statistics from last line of the log. It makes more sense to report the total number of executed benchmarks as a sentence that trying to fit into the format of preceding table.
Added test assertion that `run_benchmarks` return csv formatted log, as it is used to write the log into file in `log_results`.
`BenchmarkDoctor` analyzes performance tests and reports their conformance to the set of desired criteria. First two rules verify the naming convention.
`BenchmarkDoctor` is invoked from `Benchmark_Driver` with `check` aurgument.