The Swift benchmarking harness now has two distinct output formats:
* Default: Formatted text that's intended for human consumption.
Right now, this is just the minimum value, but we can augment that.
* `--json`: each output line is a JSON-encoded object that contains raw data
This information is intended for use by python scripts that aggregate
or compare multiple independent tests.
Previously, we tried to use the same output for both purposes. This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information. This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.
Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```
This format is easy to parse in Python. Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
j = json.loads(l)
# Default 0 if not present
max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.
Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity. Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations. In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present. Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.
This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.
For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
The script defaulted to a mode that no one uses without checking
whether the input was compatible with that mode.
This is the script used for run-to-run comparison of benchmark
results. The in-tree benchmarks happened to work with the script only
because of a fragile string comparison burried deep within the
script. Other out-of-tree benchmark scripts that generate results were
silently broken when using this script for comparison.
The `__future__` we relied on is now, where the 3 specific things are
all included [since Python 3.0](https://docs.python.org/3/library/__future__.html):
* absolute_import
* print_function
* unicode_literals
* division
These import statements are no-ops and are no longer necessary.
run_smoke_bench script fails to report code size changes if you have a
trailing '/' in <old_build_dir> but not <new_build_dir>.
This change appends a separator if it is missing
The way we already gather numbers for this test is that we run two runs of
`Benchmark_O $TEST` with num-samples=2, iters={2,3}. Under the assumption that
the only difference in counter numbers can be caused by that extra iteration,
subtracting the group of counts for 2,3 gives us the number of counts in that
iteration.
In certain cases, I have found that a small subset of the benchmarks are
producing weird output and I haven't had the time to look into why. That being
said, I do know what these weird results look like, so in this commit we do some
extra validation work to see if we need to fail a test due to instability.
The specific validation is that:
1. We perform another run with num-samples=2, iter=5 and subtract the iter=3
counts from that. Under the assumption that overall work should increase
linearly with iteration size in our benchmarks, we check if the counts are
actual 2x.
2. If either `result[iter=3] - result[iter=2]` or `result[iter=5] -
result[iter=3]` is negative. All of the counters we gather should never decrease
with iteration count.
Otherwise, one can get results that seem to imply more rr traffic when in
reality, one was not tracking {retain,release}_n that as a result of better
optimization become just simple retain, release.