Commit Graph

100 Commits

Author SHA1 Message Date
Alexander Cyon
9d04bfd848 [benchmark] Fix typos 2024-07-06 13:17:13 +02:00
Tim Kientzle
2a3e68a1f8 Match new benchmark driver default output 2022-11-04 16:16:37 -07:00
Tim Kientzle
998475bf80 Pylint cleanup, more comments 2022-11-04 14:02:03 -07:00
Tim Kientzle
b4fa3833d8 Comment some TODO items 2022-11-04 14:02:03 -07:00
Tim Kientzle
520fd79efd Fix some test failures
The new code stores test numbers as numbers (not strings), which
requires a few adjustments. I also apparently missed a few test updates.
2022-11-04 14:02:03 -07:00
Tim Kientzle
971a5d8547 Overhaul Benchmarking pipeline to use complete sample data, not summaries
The Swift benchmarking harness now has two distinct output formats:

* Default: Formatted text that's intended for human consumption.
  Right now, this is just the minimum value, but we can augment that.

* `--json`: each output line is a JSON-encoded object that contains raw data
  This information is intended for use by python scripts that aggregate
  or compare multiple independent tests.

Previously, we tried to use the same output for both purposes.  This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information.  This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```

This format is easy to parse in Python.  Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.

Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity.  Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations.  In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present.  Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.

For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
2022-11-04 14:02:03 -07:00
Andrew Trick
f09cc8cc8b Fix compare_perf_tests.py for running locally.
The script defaulted to a mode that no one uses without checking
whether the input was compatible with that mode.

This is the script used for run-to-run comparison of benchmark
results. The in-tree benchmarks happened to work with the script only
because of a fragile string comparison burried deep within the
script. Other out-of-tree benchmark scripts that generate results were
silently broken when using this script for comparison.
2022-05-12 16:50:32 -07:00
Josh Soref
fa3ff899a9 Spelling benchmark (#42457)
* spelling: approximate

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: available

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: benchmarks

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: between

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: calculation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: characterization

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: coefficient

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: computation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: deterministic

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: divisor

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: encounter

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: expected

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: fibonacci

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: fulfill

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: implements

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: into

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: intrinsic

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: markdown

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: measure

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: occurrences

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: omitted

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: partition

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: performance

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: practice

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: preemptive

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: repeated

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requirements

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requires

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: response

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: supports

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: unknown

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: utilities

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: verbose

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

Co-authored-by: Josh Soref <jsoref@users.noreply.github.com>
2022-04-25 09:02:06 -07:00
Daniel Duan
3dfc40898c [NFC] Remove Python 2 imports from __future__ (#42086)
The `__future__` we relied on is now,  where the 3 specific things are
all included [since Python 3.0](https://docs.python.org/3/library/__future__.html):

* absolute_import
* print_function
* unicode_literals
* division

These import statements are no-ops and are no longer necessary.
2022-04-13 14:01:30 -07:00
swift-ci
32a967f1ea Merge pull request #39171 from eltociear/patch-22 2022-01-13 07:01:02 -08:00
Evan Wilde
6956b7c5c9 Replace /usr/bin/python with /usr/env/python
/usr/bin/python doesn't exist on ubuntu 20.04 causing tests to fail.
I've updated the shebangs everywhere to use `/usr/bin/env python`
instead.
2021-09-28 10:05:05 -07:00
Ikko Ashimine
c48f6e09bb [benchmark] Fix typo in compare_perf_tests.py
formating -> formatting
2021-09-04 09:10:34 +09:00
tbkka
3181dd1e4c Fix a bunch of python lint errors (#32951)
* Fix a bunch of python lint errors

* adjust indentation
2020-07-17 14:30:21 -07:00
Sergej Jaskiewicz
cce9e81f0b Support Python 3 in the benchmark suite 2020-02-28 01:45:35 +03:00
Ross Bayer
b1961745e0 [Python: black] Reformatted the benchmark Python sources using utils/python_format.py. 2020-02-08 15:32:44 -08:00
Pavol Vaskovic
cc0e16ca34 [benchmark] LogParser: measurement metadata 2019-07-23 19:44:41 +02:00
Pavol Vaskovic
007d398f4a [Gardening] ReportFormatter: tying up loose ends 2019-05-24 00:18:44 +02:00
Pavol Vaskovic
b3f7996ea7 [benchmark] ReportFormatter: better inline headers
Improve inline headers in `single_table` mode to also print labels for the numeric columns.

Sections in the `single_table` are visually distinguished by a separator row preceding the the inline headers.

Separated header label styles for git and markdown modes with UPPERCASE and **Bold**  formatting respectively.

Inlined section template definitions.
2019-05-23 23:24:51 +02:00
Pavol Vaskovic
9750581bf5 [benchmark] ReportFormatter: right-align num cols 2019-05-23 19:32:34 +02:00
Pavol Vaskovic
af7ef03aaf [benchmark] ReportFormatter: refactor header logic
Confine the logic for printing headers to the header function.
2019-05-23 17:28:21 +02:00
Pavol Vaskovic
a998e18e18 [benchmark] ReportFormatter: faster templating
It is slightly faster to simply concatenate strings that don’t require special formatting.
2019-05-23 12:29:19 +02:00
Pavol Vaskovic
49d25bfc51 [benchmark] ReportFomatter: de-tuple
Remove unnecessary list-to-tuple conversions.
2019-05-23 12:20:19 +02:00
Pavol Vaskovic
691007b029 [benchmark] LogParser: Accept -?! in bench. names
Extend parser to support benchmark names that include `-?!` in names, to fully support the new Naming Convention from PR #20334.
2019-02-19 23:31:58 +01:00
Erik Eckstein
040aa06fec benchmarks: combine everything which is needed into run_smoke_bench
Now, run_smoke_bench runs the benchmarks, compares performance and code size and reports the results - on stdout and as a markdown file.
No need to run bench_code_size.py and compare_perf_tests.py separately.

This has two benefits:
- It's much easier to run it locally
- It's now more transparent what's happening in '@swiftci benchmark', because now all the logic is in run_smoke_bench rather than in the not visible script on the CI bot.

I also remove the branch-arguments from ReportFormatter in ompare_perf_tests.py. They were not used anyway.

For a smooth rollout in CI, I created a new script rather than changing the existing one. Once everything is setup in CI, I'll delete the old run_smoke_test.py and bench_code_size.py.
2018-11-01 16:41:39 -07:00
Pavol Vaskovic
897b9ef82e [benchmark] Gardening: Fix linter nitpicks 2018-10-27 06:15:23 +02:00
Pavol Vaskovic
397c44747b [benchmark] Exclude outliers from sample
Use the box-plot inspired technique for filtering out outlier measurements. Values that are higher than the top inner fence (TIF = Q3 + IQR * 1.5) are excluded from the sample.
2018-10-11 19:48:20 +02:00
Pavol Vaskovic
0d318b6464 [benchmark] Discard oversampled quantile values
When num_samples is less than quantile + 1, some of the measurements are repeated in the report summary. Parsed samples should strive to be a true reflection of the measured distribution, so we’ll correct this by discarding the repetated artifacts from quantile estimation.

This avoids introducting a bias from this oversampling into the empirical distribution obtained from merging independent samples.

See also:
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
61a092a695 [benchmark] LogParser delta quantiles support
Support for reading delta-encoded quantiles format.
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
012e07cdd2 [benchmark] LogParser support for quantile format
Gather all samples published in the benchamark summary from the `Benchmark_O --quantile` output format.
2018-10-09 15:52:28 +02:00
Pavol Vaskovic
9ba571f641 [benchmark] Parse yield timings from verbose log 2018-10-09 09:52:14 +02:00
Pavol Vaskovic
0f25849f8c [benchmark] Parse setup time from verbose log 2018-10-09 09:52:06 +02:00
Pavol Vaskovic
78159e1fe3 [benchmark] Fix drop mean and sd on merge 2018-10-09 09:50:45 +02:00
Pavol Vaskovic
f729b8e623 [benchmark] Fix merging max_rss when None 2018-10-06 11:43:56 +02:00
Pavol Vaskovic
a9f0ce4338 [benchmark] Fix quantile estimation type
The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3.  It's the inverse of empirical distribution function.

References:
* https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
* discussion in https://github.com/apple/swift/pull/19097#issuecomment-421238197
2018-09-20 09:19:07 +02:00
Ben Langmuir
541c48f9e4 Merge pull request #19328 from palimondo/test-twice-commit-once
[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening (take 2)
2018-09-17 11:54:08 -07:00
Pavol Vaskovic
f0e7b8737a [benchmark] Round quantile idx to nearest or even
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See:
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
e48b5fdb34 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-09-14 23:40:43 +02:00
Ben Langmuir
423e145b0c Revert "[benchmark] Report Quantiles from Benchmark_O and a TON of Gardening" 2018-09-14 13:24:01 -07:00
Pavol Vaskovic
2ad8bf732a [benchmarks] Rename column label SPEEDUP to RATIO
Since the results comparisons are now used to also compare code sizes in addition to runtimes, it makes sense to rename the column label to the more neutral term “ratio” instead of old “speedup”.
2018-09-13 22:00:52 +02:00
Pavol Vaskovic
a56c55c8e4 [benchmark] Round quantile idx to nearest or even
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See:
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
2018-09-10 10:45:00 +02:00
Pavol Vaskovic
0db20feda2 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-08-31 07:32:10 +02:00
Erik Eckstein
1f32935fc4 benchmarks: fix regexp for parsing code size results
Accept a '.' in the benchmark name which is used for .o and .dylib files
2018-08-27 17:07:40 -07:00
Pavol Vaskovic
049ffb34b0 [benchmark] Fix parsing formatted text
The test number column in the space justified column format emmited by the Benchmark_Driver to stdout  while logging to file is right aligned, so it must handle leading whitespace.
2018-08-23 12:31:00 +02:00
Pavol Vaskovic
0d64386b53 [benchmark] Documentation improvements
Improving complience with
PEP 257 -- Docstring Conventions
https://www.python.org/dev/peps/pep-0257/
2018-08-23 11:45:43 +02:00
Pavol Vaskovic
076415f969 [benchmark] Strangler run_benchmarks
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed.

Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
2018-08-17 08:40:39 +02:00
Pavol Vaskovic
e80165f316 [benchmark] Exclude only outliers from the top
Option to exclude the outliers only from top of the range, leaving in the outliers on the min side.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
27cc77c590 [benchmark] Exclude outliers from samples
Introduce algorithm for excluding of outliers after collecting all samples using the Interquartile Range rule.

The `exclude_outliers` method uses 1st and 3rd Quartile to compute Interquartile Range, then uses inner fences at Q1 - 1.5*IQR and Q3 + 1.5*IQR to remove samples outside this fence.

Based on experiments with collecting hundreads and thousands of samples (`num_samples`) per test with low iteration count (`num_iters`) with ~1s runtime, this rule is very effective in providing much better quality of sample population, effectively removing short environmental fluctuations that were previously averaged into the overall result (by the adaptively determined `num_iters` to run for ~1s), enlarging the reported result with these measurement errors. This technique can be used for some benchmarks, to get more stable results faster than before.

This outlier filering is employed when parsing `--verbose` test results.
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
91077e3289 [benchmark] Introduced PerformanceTestSamples
* Moved the functionality to compute median, standard deviation and related statistics from `PerformanceTestResult` into `PerformanceTestSamples`.
* Fixed wrong unit in comments
2018-08-17 08:39:50 +02:00
Pavol Vaskovic
bea35cb7c1 [benchmark] LogParser measure environment
Measure more of environment during test

In addition to measuring maximum resident set size, also extract number of voluntary and involuntary context switches from the verbose mode.
2018-08-17 00:32:04 +02:00
Pavol Vaskovic
c60e223a3b [benchmark] LogParser: tab & space delimited logs
Added support for tab delimited and formatted log output (space aligned columns as output to console by Benchmark_Driver).
2018-08-17 00:32:04 +02:00