Commit Graph

174 Commits

Author SHA1 Message Date
Slava Pestov
48eddac961 Benchmarks: Add support for async benchmarks 2025-08-27 10:37:10 -04:00
Slava Pestov
2ec19ecb46 Benchmarks: Skip long benchmarks in -Onone build 2025-08-27 10:37:10 -04:00
Max Desiatov
21a2b78801 stdlib/benchmark: add canImport(Musl) where needed (#67120)
This allows compiling stdlib and benchmarks when targeting musl instead of Glibc.
2023-07-05 19:55:08 +01:00
Tim Kientzle
b8e023ad53 Make the default output a little more like the old version (for now) 2022-11-04 18:07:12 -07:00
Tim Kientzle
08604eab40 Fix colliding fields; match old format more closely 2022-11-04 16:16:13 -07:00
Tim Kientzle
30b3763211 Fix underflow in the padding calculation 2022-11-04 14:02:03 -07:00
Tim Kientzle
971a5d8547 Overhaul Benchmarking pipeline to use complete sample data, not summaries
The Swift benchmarking harness now has two distinct output formats:

* Default: Formatted text that's intended for human consumption.
  Right now, this is just the minimum value, but we can augment that.

* `--json`: each output line is a JSON-encoded object that contains raw data
  This information is intended for use by python scripts that aggregate
  or compare multiple independent tests.

Previously, we tried to use the same output for both purposes.  This required
the python scripts to do more complex parsing of textual layouts, and also meant
that the python scripts had only summary data to work with instead of full raw
sample information.  This in turn made it almost impossible to derive meaningful
comparisons between runs or to aggregate multiple runs.

Typical output in the new JSON format looks like this:
```
{"number":89, "name":"PerfTest", "samples":[1.23, 2.35], "max_rss":16384}
{"number":91, "name":"OtherTest", "samples":[14.8, 19.7]}
```

This format is easy to parse in Python.  Just iterate over
lines and decode each one separately. Also note that the
optional fields (`"max_rss"` above) are trivial to handle:
```
import json
for l in lines:
   j = json.loads(l)
   # Default 0 if not present
   max_rss = j.get("max_rss", 0)
```
Note the `"samples"` array includes the runtime for each individual run.

Because optional fields are so much easier to handle in this form, I reworked
the Python logic to translate old formats into this JSON format for more
uniformity.  Hopefully, we can simplify the code in a year or so by stripping
out the old log formats entirely, along with some of the redundant statistical
calculations.  In particular, the python logic still makes an effort to preserve
mean, median, max, min, stdev, and other statistical data whenever the full set
of samples is not present.  Once we've gotten to a point where we're always
keeping full samples, we can compute any such information on the fly as needed,
eliminating the need to record it.

This is a pretty big rearchitecture of the core benchmarking logic. In order to
try to keep things a bit more manageable, I have not taken this opportunity to
replace any of the actual statistics used in the higher level code or to change
how the actual samples are measured. (But I expect this rearchitecture will make
such changes simpler.) In particular, this should not actually change any
benchmark results.

For the future, please keep this general principle in mind: Statistical
summaries (averages, medians, etc) should as a rule be computed for immediate
output and rarely if ever stored or used as input for other processing. Instead,
aim to store and transfer raw data from which statistics can be recomputed as
necessary.
2022-11-04 14:02:03 -07:00
Tim Kientzle
48c1931c78 Unbreak delta reporting in benchmarks (#61236)
The logic here was apparently intended to omit literal zeros from deltas
to save a few bytes, but it instead drops all zeros from all columns.
Remove the condition that drops zeros in order to avoid confusing
the many scripts that consume this data.

Alternatives Considered

I'm probably going to entirely drop the delta form in an upcoming
PR, so I didn't think it was worthwhile to do something more complex,
such as:

* Fixing this logic to only omit zeros from actual delta columns

* Rewriting all the client scripts to treat any empty column as zero
2022-09-22 10:22:16 -07:00
Andrew Trick
f09cc8cc8b Fix compare_perf_tests.py for running locally.
The script defaulted to a mode that no one uses without checking
whether the input was compatible with that mode.

This is the script used for run-to-run comparison of benchmark
results. The in-tree benchmarks happened to work with the script only
because of a fragile string comparison burried deep within the
script. Other out-of-tree benchmark scripts that generate results were
silently broken when using this script for comparison.
2022-05-12 16:50:32 -07:00
Karoy Lorentey
8304e6c0bf Merge pull request #39336 from lorentey/decapitate-benchmarks
[benchmark][NFC] Use Swift naming conventions
2021-09-20 17:16:35 -07:00
Karoy Lorentey
758c52bc2a [benchmark] Don't create array instance in modules with solitary benchmarks
It just produces unnecessary code sign churn.
2021-09-16 18:54:14 -07:00
Karoy Lorentey
6cf798cd6d [benchmark] Trap if deterministic hashing isn't enabled 2021-09-16 16:57:06 -07:00
Karoy Lorentey
8944591e71 [benchmark] Simplify benchmark registration 2021-09-15 22:08:08 -07:00
Karoy Lorentey
8910b75cfe [benchmark] Stop capitalizing function and variable names 2021-09-15 22:08:07 -07:00
Ikko Ashimine
473e4af90a [benchmark] Fix typo in DriverUtils.swift
reseting -> resetting
2021-01-14 01:50:21 +09:00
Mao ZiJun
d1259cec50 eliminated "dangling pointer" warnings 2019-12-09 17:41:20 +09:00
Pavol Vaskovic
5571b83353 [benchmark] Driver: log measurement metadata
Added --meta option to log measurement metadata:

* PAGES – number of memory pages used
* ICS – number of involuntary context switches
* YIELD – number of voluntary yields

(Pages and ICS were previously available only in --verbose mode.)
2019-07-23 17:40:45 +02:00
Pavol Vaskovic
ec32140aed [benchmark] Run benchmarks using substring filters
Added support for running benchmarks using substring filters. Positional arguments prefixed with a single + or - sign are interpreted as benchmark name filters.

Excecutes all benchmarks whose names include any of the strings prefixed with a plus sign but none of the strings prefixed with a minus sign.
2019-07-07 11:59:45 +02:00
Pavol Vaskovic
ad24ca4ba6 [benchmark] Add min-sample argument to drivers
Support for gathering a minimal number of samples per benchmark, using the optional `--min-samples` argument, which overrides the automatically computed number of samples per `sample-time` if this is lower.
2019-07-07 10:13:26 +02:00
Pavol Vaskovic
5190db0acd [Gardening][benchmark] Import MSVCRT on Windows
Import functions from standard C library on Windows.
2019-07-01 16:11:55 +02:00
Pavol Vaskovic
9d6f7ad160 [benchmark] Driver & Doctor: Lower the sample cap
Lowered the default sample cap from 2k to 200. (This doesn’t effect manually specified `--num-samples` argument in the driver.)

Swift benchmarks have pretty constant performance profile over time. It’s more beneficial to get multiple independent measurements faster, than more samples from the same run.
2018-12-07 15:06:43 +01:00
Pavol Vaskovic
0bdd3ef275 [benchmark] Equalize memory usage (w&w/o verbose)
The use of `--verbose` parameter was affecting the reported memory usage (`--memory`), because it front-loads initialization of string interpolation and printing.

By always computing the configuration string and always calling print, the baseline memory measurement no longer includes this constant overhead.
2018-11-28 21:34:12 +01:00
Pavol Vaskovic
a7f832fb57 [benchmark] Legacy factor
This adds optional `legacyFactor` to the `BenchmarkInfo`, which allows for linear modification of constants that unnecesarily inflate the base workload of benchmarks, while maintaining the continuity of log-term benchmark tracking.

For example, if a benchmark uses `for _ in N*10_000` in its run function, we could lower this to `for _ in N*1_000` and adding a `legacyFactor: 10` to its `BenchmarkInfo`.

Note that this doesn’t affect the real measurements gathered from the `--verbose` output. The `BenchmarkDoctor` has been slightly adjusted to work with these real samples, therefore `Benchmark_Driver check` will not flag these benchmarks for slow run time reported in the summary, if their real runtimes fall into the recommended range.
2018-11-01 06:24:27 +01:00
eeckstein
cd920b69f4 Merge pull request #19910 from palimondo/fluctuation-of-the-pupil
[benchmark] More Robust Benchmark_Driver
2018-10-29 15:02:07 -07:00
Michael Gottesman
ba7815b663 [benchmark] Fix swiftpm based benchmark build on Linux. 2018-10-29 12:15:20 -07:00
Pavol Vaskovic
67b489dcb1 [benchmark] Auto-determine number of samples
When measuring with specified number of iterations (generally, `--num-iters=1` makes sense), automaticially determine the number of samples to take, so that the overall measurement duration comes close to `sample-time`.

This is the same technique used to scale `num-iters` before, but for `num-samples`.
2018-10-11 18:56:27 +02:00
Pavol Vaskovic
9d9200e9eb [benchmark] Measure setUpFunction
Measure the duration of the `setUpFunction` and report it in verbose mode.

This will be used by `BenchmarkDoctor`, to ensure there isn’t unreasonably big imbalance between the time it takes to set up and run the actual benchmark.
2018-10-02 14:34:43 +02:00
Pavol Vaskovic
a9f0ce4338 [benchmark] Fix quantile estimation type
The correct quantile estimation type for printing all measurements in the summary report while `quantile == num-samples - 1` is R-1, SAS-3.  It's the inverse of empirical distribution function.

References:
* https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
* discussion in https://github.com/apple/swift/pull/19097#issuecomment-421238197
2018-09-20 09:19:07 +02:00
Pavol Vaskovic
f0e7b8737a [benchmark] Round quantile idx to nearest or even
Explicitly use round-half-to-even rounding algorithm to match the behavior of numpy's quantile(interpolation='nearest') and quantile estimate type R-3, SAS-2. See:
https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
8b3b1f695a [benchmark] Option: delta encoded quantiles format
Added `--delta` argument to print the quantiles in delta encoded format, that ommits 0s.

This results in machine and human readable output that highlights modes and is easily digestible, giving you the feel for the underlying probability distribution of the samples in the reported results:

````
$ ./Benchmark_O --num-iters=1 --num-samples=20 --quantile=20 --delta 170 171 184 185 198 199 418 419 432 433 619 620
#,TEST,SAMPLES,MIN(μs),𝚫V1,𝚫V2,𝚫V3,𝚫V4,𝚫V5,𝚫V6,𝚫V7,𝚫V8,𝚫V9,𝚫VA,𝚫VB,𝚫VC,𝚫VD,𝚫VE,𝚫VF,𝚫VG,𝚫VH,𝚫VI,𝚫VJ,𝚫MAX
170,DropFirstArray,20,171,,,,,,,,,,,,,,,,,,,2,29
171,DropFirstArrayLazy,20,168,,,,,,,,,,,,,,,,,,,,8
184,DropLastArray,20,55,,,,,,,,,,,,,,,,,,,,26
185,DropLastArrayLazy,20,65,,,,,,,,,,,,,,,,,,,1,90
198,DropWhileArray,20,214,1,,,,,,,,,,,,,,,,,1,27,2
199,DropWhileArrayLazy,20,464,,,,1,,,,,,,,1,1,1,4,9,1,9,113,2903
418,PrefixArray,20,132,,,,,,,,,,,,,,,,,1,1,32,394
419,PrefixArrayLazy,20,168,,,,,,,,,,,,1,,2,9,1,15,8,88,3338
432,PrefixWhileArray,20,252,1,,,,1,,,,,,,,,,,1,,,,30
433,PrefixWhileArrayLazy,20,168,,,,,,,,,,,,,1,,6,6,14,43,28,10200
619,SuffixArray,20,68,,,,,,,,,,,,,1,,,,22,1,1,4
620,SuffixArrayLazy,20,65,,,,,,,,,,,,,,,,,,1,9,340
````
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
72e960457b [benchmark] Gardening maxRSS as Int? 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
022e1111a9 [benchmark] Report quantiles from samples
The default benchmark result reports statistics of a normal distribution — mean and standard deviation. Unfortunately the samples from our benchmarks are *not normally distributed*. To get a better picture of the underlying probability distribution, this adds support for reporting quantiles.

See https://en.wikipedia.org/wiki/Quantile

This gives better subsample of the measurements in the summary, without need to resort to the use of a full verbose mode, which might be unnecessarily slow.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
219a5d9290 [benchmark] Rename SampleRunner -> TestRunner
It is now running all the benchmarks, so it’s a TestRunner.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
0e751e2717 [benchmark] Gardening: Even nicer microseconds 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
d704557c88 [benchmark] Gardening: Fixed method indentation 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
12c6e39a20 [benchmark] Refactor run runBenchmarks logVerbose
Extracted nested func logVerbose as instance method on SampleRunner.

Internalized the free functions `runBech` and `runBenchmarks` into SampleRunner as methods `run` and `runBenchmarks`.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
e7d1d482d8 [benchmark] Extract yield & add resetMeasurements 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
331c0bf772 [benchmark] Refactor numIters computation
The spaghetti if-else code was untangled into nested function that computes `iterationsPerSampleTime` and a single constant `numIters` expression that takes care of the overflow capping as well as the choice between fixed and computed `numIters` value.

The `numIters` is now computed and logged only once per benchmark measurement instead of on every sample.

The sampling loop is now just a single line. Hurrah!

Modified test to verify that the `LogParser` maintains `num-iters` derived from the `Measuring with scale` message across samples.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
29b2cc7397 [benchmark] Refactor sampling loop with addSample
Extracted sample saving to inner func `addSample`.
Used it to save the `oneIter` sample from `numIters` calibration when it comes out as 1 and continue the for loop to next sample.

This simplified following code that can now always measure the sample with `numIters` and save it.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
b762f80a64 [benchmark] Gardening: Documentation of numIters
Clarified the need for capping `numIters` according to the discussion at https://github.com/apple/swift/pull/17268#issuecomment-404831035

The sampling loop is a hairy piece of code, because it’s trying to reuse the calibration measurement as a regular sample, in case the computed `numIters` turns out to be 1. But it conflicts with the case when `fixedNumIters` is 1, necessitating a separate measurement in the else branch… That was a quick fix back then, but its hard to make it clean. More thinking is required…
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
75604a285d [benchmark] Gardening: Sensibly rename variables
To make sense of this spaghetti code, let’s first use reasonable variable names:
* scale -> numIters
* elapsed_time -> time
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
a169606e60 [benchmark] Gardening: DRYer verbose log 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
9ae69908b0 [benchmark] Refactor to currency type Int
Removed unnecessary use of UInt64, where appropriate, following the advice from Swift Language Guide:

> Use the `Int` type for all general-purpose integer constants and variables in your code, even if they’re known to be nonnegative. Using the default integer type in everyday situations means that integer constants and variables are immediately interoperable in your code and will match the inferred type for integer literal values.
https://docs.swift.org/swift-book/LanguageGuide/TheBasics.html#ID324
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
28eb79819b [benchmark] Refactor to report samples in μs
Moved the adjustment of `lastSampleTime` to account for the `scale` (`numIters`) and conversion to microseconds into SampleRunner’s `measure` method.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
beabad86f4 [benchmark] Gardening: scale was always Int
Since the `scale` (or `numIters`) is passed to the `test.runFunction` as `Int`, the whole type-casting dance here was just silly!
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
e775b8fc60 [benchmark] Gardening: numSamples UInt vs Int
Type check command line argument to be non-negative, but store value in currency type `Int`.
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
79d7730be8 [benchmark] Gardening: afterRunSleep is UInt32 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
7768cb3295 [benchmark] Move stats computation to BenchResults 2018-09-14 23:40:43 +02:00
Pavol Vaskovic
e48b5fdb34 [benchmark] Fix index computation for quantiles
Turns out that both the old code in `DriverUtils` that computed median, as well as newer quartiles in `PerformanceTestSamples` had off-by-1 error.

It trully is the 3rd of the 2 hard things in computer science!
2018-09-14 23:40:43 +02:00
Pavol Vaskovic
ab3e6122c0 [benchmark] Refactor min max median computation
We can spare 2 array passes (for min and max), if we just sort first.
2018-09-14 23:40:43 +02:00