Commit Graph

287 Commits

Author SHA1 Message Date
Egor Zhdan
f931de8948 [benchmark] Do not abort with TypeError if no memory measurements were taken 2023-01-23 13:34:17 +00:00
Tim Kientzle
961a38b636 Test the non-JSON output
We have to continue using the non-JSON forms
until the JSON-supporting code is universally
available.
2022-11-07 14:46:13 -08:00
Tim Kientzle
dfe8284462 pylint fixes 2022-11-07 14:45:59 -08:00
Tim Kientzle
168130741b Pylint fixes 2022-11-07 14:45:37 -08:00
Tim Kientzle
b0ce365b53 A better way to adapt to -num-samples 2022-11-05 16:05:26 -07:00
Tim Kientzle
c3a727486f Make --num-samples actually work 2022-11-05 14:18:41 -07:00
Tim Kientzle
5c14017bba For size comparisons, build the result objects directly with sample data 2022-11-05 14:17:34 -07:00
Tim Kientzle
e1ab70a4b0 Use results consistently 2022-11-05 13:29:52 -07:00
Tim Kientzle
a63adc9114 Use non-json format for now until we have switched over completely 2022-11-04 16:17:57 -07:00
Tim Kientzle
2a3e68a1f8 Match new benchmark driver default output 2022-11-04 16:16:37 -07:00
Tim Kientzle
40eaaac0b1 Do not use --json for listing tests (yet) 2022-11-04 14:02:03 -07:00
Tim Kientzle
998475bf80 Pylint cleanup, more comments 2022-11-04 14:02:03 -07:00
Tim Kientzle
b4fa3833d8 Comment some TODO items 2022-11-04 14:02:03 -07:00
Tim Kientzle
071e9f1c7e Python style fixes 2022-11-04 14:02:03 -07:00
Tim Kientzle
520fd79efd Fix some test failures
The new code stores test numbers as numbers (not strings), which
requires a few adjustments. I also apparently missed a few test updates.
2022-11-04 14:02:03 -07:00
Tim Kientzle
971a5d8547 Overhaul Benchmarking pipeline to use complete sample data, not summaries
The Swift benchmarking harness now has two distinct output formats:

* Default: Formatted text that's intended for human consumption.
  Right now, this is just the minimum value, but we can augment that.

* `--json`: each output line is a JSON-encoded object that contains raw data
  This information is intended for use by python scripts that aggregate
  or compare multiple independent tests.

Previously, we tried to use the same output for both purposes.  This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information.  This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```

This format is easy to parse in Python.  Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.

Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity.  Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations.  In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present.  Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.

For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
2022-11-04 14:02:03 -07:00
Alex Lorenz
cd634444c5 [benchmark] markdown report handler - write encoded message to byte buffer 2022-11-03 11:55:27 -07:00
Boris Bügling
f71eb8e2cb Remove use of deprecated option 2022-10-13 22:47:19 -07:00
YOCKOW
c1e154a9cb [Gardening] Remove trailing whitespaces in Python scripts. (W291)
That has been marked as 'FIXME' for three years.
This commit fixes it.
2022-08-25 16:08:36 +09:00
Andrew Trick
f09cc8cc8b Fix compare_perf_tests.py for running locally.
The script defaulted to a mode that no one uses without checking
whether the input was compatible with that mode.

This is the script used for run-to-run comparison of benchmark
results. The in-tree benchmarks happened to work with the script only
because of a fragile string comparison burried deep within the
script. Other out-of-tree benchmark scripts that generate results were
silently broken when using this script for comparison.
2022-05-12 16:50:32 -07:00
Josh Soref
fa3ff899a9 Spelling benchmark (#42457)
* spelling: approximate

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: available

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: benchmarks

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: between

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: calculation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: characterization

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: coefficient

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: computation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: deterministic

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: divisor

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: encounter

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: expected

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: fibonacci

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: fulfill

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: implements

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: into

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: intrinsic

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: markdown

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: measure

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: occurrences

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: omitted

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: partition

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: performance

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: practice

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: preemptive

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: repeated

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requirements

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requires

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: response

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: supports

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: unknown

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: utilities

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: verbose

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

Co-authored-by: Josh Soref <jsoref@users.noreply.github.com>
2022-04-25 09:02:06 -07:00
Erik Eckstein
fb65284995 benchmarks: fix run_smoke_bench after upgrading to python3
Need to decode result of `subprocess.check_output`
2022-04-19 13:59:55 +02:00
Daniel Duan
3dfc40898c [NFC] Remove Python 2 imports from __future__ (#42086)
The `__future__` we relied on is now,  where the 3 specific things are
all included [since Python 3.0](https://docs.python.org/3/library/__future__.html):

* absolute_import
* print_function
* unicode_literals
* division

These import statements are no-ops and are no longer necessary.
2022-04-13 14:01:30 -07:00
Daniel Duan
06a04624a6 [benchmark] Remove Python 2 logic (#42048)
Found a few pieces of Python 2 code. Remove them since we are on Python
3 entirely.
2022-03-27 15:02:58 -07:00
swift-ci
32a967f1ea Merge pull request #39171 from eltociear/patch-22 2022-01-13 07:01:02 -08:00
Evan Wilde
6956b7c5c9 Replace /usr/bin/python with /usr/env/python
/usr/bin/python doesn't exist on ubuntu 20.04 causing tests to fail.
I've updated the shebangs everywhere to use `/usr/bin/env python`
instead.
2021-09-28 10:05:05 -07:00
Karoy Lorentey
8304e6c0bf Merge pull request #39336 from lorentey/decapitate-benchmarks
[benchmark][NFC] Use Swift naming conventions
2021-09-20 17:16:35 -07:00
Karoy Lorentey
2fbf391b57 [benchmark] Benchmark_Driver: Correctly set SWIFT_DETERMINISTIC_HASHING 2021-09-16 16:57:35 -07:00
Karoy Lorentey
8910b75cfe [benchmark] Stop capitalizing function and variable names 2021-09-15 22:08:07 -07:00
Ikko Ashimine
c48f6e09bb [benchmark] Fix typo in compare_perf_tests.py
formating -> formatting
2021-09-04 09:10:34 +09:00
Guillaume Lessard
715b3fa7d0 [gardening] update copyright year in the benchmark template 2021-07-22 16:39:15 -06:00
Erik Eckstein
b5f1e265e0 benchmarks: disable the flaky test_log_file BenchmarkDriver test
I'm not sure if it makes sense to keep this test around at all.
For now I just disabled it.

rdar://79701124
2021-06-28 13:13:02 +02:00
Mishal Shah
ddabee30e2 Add arch info to benchmark report 2021-06-01 09:59:20 -07:00
Erik Eckstein
abcae7bfa1 benchmarks: fix smoke test run by setting the dynamic library path
This is a workaround for rdar://78584073
2021-05-31 15:10:08 +02:00
Mishal Shah
40024718ac Update doc and links to support new main branch 2020-09-22 23:53:29 -07:00
Michael Gottesman
0591fa0d6b [leaks-checker] Add verbose flag to dump out raw output from runtime to help debug failures on bots.
Just a quick hack to ease debugging on the bots.
2020-09-21 09:59:32 -05:00
Xiaodi Wu
514dce144f Update copyright year on benchmark and its template 2020-09-04 15:02:38 -04:00
tbkka
ab861d5890 Pass architecture into Benchmark_Driver to fix build-script -B (#33100)
* Pass architecture into Benchmark_Driver to fix `build-script -B`

* "Benchmark_Driver compare" does not need the architecture
2020-07-25 11:15:49 -07:00
tbkka
3181dd1e4c Fix a bunch of python lint errors (#32951)
* Fix a bunch of python lint errors

* adjust indentation
2020-07-17 14:30:21 -07:00
Erik Eckstein
2387732ab5 benchmarks: support new executable file names in perf_test_driver
rdar://problem/65508278
2020-07-16 15:43:37 +02:00
Erik Eckstein
a46cda8c51 benchmarks: fix run_smoke_bench to support new benchmark executable naming scheme
Find the right benchmark executable with a glob pattern.
Also, add an option "-arch" to select between executables for different architectures.
2020-07-07 11:01:49 +02:00
Meghana Gupta
911ac8e45e Fix code size reporting when input directory is missing a trailing '/'
run_smoke_bench script fails to report code size changes if you have a
trailing '/' in <old_build_dir> but not <new_build_dir>.

This change appends a separator if it is missing
2020-05-14 13:49:35 -07:00
Sergej Jaskiewicz
cce9e81f0b Support Python 3 in the benchmark suite 2020-02-28 01:45:35 +03:00
Ross Bayer
b1961745e0 [Python: black] Reformatted the benchmark Python sources using utils/python_format.py. 2020-02-08 15:32:44 -08:00
Michael Gottesman
2840a7609d When gathering counters, check for instability and FAIL otherwise.
The way we already gather numbers for this test is that we run two runs of
`Benchmark_O $TEST` with num-samples=2, iters={2,3}. Under the assumption that
the only difference in counter numbers can be caused by that extra iteration,
subtracting the group of counts for 2,3 gives us the number of counts in that
iteration.

In certain cases, I have found that a small subset of the benchmarks are
producing weird output and I haven't had the time to look into why. That being
said, I do know what these weird results look like, so in this commit we do some
extra validation work to see if we need to fail a test due to instability.

The specific validation is that:

1. We perform another run with num-samples=2, iter=5 and subtract the iter=3
counts from that. Under the assumption that overall work should increase
linearly with iteration size in our benchmarks, we check if the counts are
actual 2x.

2. If either `result[iter=3] - result[iter=2]` or `result[iter=5] -
result[iter=3]` is negative. All of the counters we gather should never decrease
with iteration count.
2020-01-15 14:41:21 -08:00
Michael Gottesman
461f17e5b7 Change -csv flag to be --emit-csv. 2020-01-15 14:41:21 -08:00
Michael Gottesman
35aa0405d1 Pattern match test names, not numbers to capture test names from Benchmark_O --list
This makes the output of the test more readable.
2020-01-15 14:41:21 -08:00
Michael Gottesman
676411f0b0 Have dtrace aggregate rr opts and start tracking {retain,release}_n.
Otherwise, one can get results that seem to imply more rr traffic when in
reality, one was not tracking {retain,release}_n that as a result of better
optimization become just simple retain, release.
2020-01-15 14:39:55 -08:00
Michael Gottesman
6fff30c122 [benchmark-dtrace] Enabling multiprocessing option to speed up gathering data. 2020-01-08 16:06:56 -08:00
Michael Gottesman
c7c2e6e17b [benchmark-dtrace] Fix the amount of samples taken along side the number of iters.
Otherwise, the output is not stable.
2020-01-08 16:06:56 -08:00