Previously we process every instruction every time the data flow re-iterates.
This is very inefficient.
In addition to moving to genset and killset, we also group function into
OneIterationFunction which we know that the data flow would converge in 1 iteration
and functions that requre the iterative data flow, mostly due to backedges in loops.
we process them differently.
I observed that there are ~93% of the functions that require just a single iteration
to perform the RLE.
But the other 7% accounts for 2321 (out of 6318) of the redundant loads we eliminated.
This change reduces RLE compilation from 4.1% to 2.7% of the entire compilation time
(frontend+OPT+LLVM) on stdlib with -O. This represents 6.9% of the time spent
in SILOptimizations (38.8%).
~2 weeks ago, RLE was taking 1.9% of the entire compilation time. It rose to 4.1%
mostly due to that we are now eliminating many more redundant loads (mostly thank
to Erik's integragtion of escape analysis in alias analysis). i.e. 3945 redundant
loads elimnated before Erik's change to 6318 redundant loads eliminated now.
of which a single post-order would be enough for DSE. In this case, we do not really
need to compute the genset and killset (which is a costly operation).
On stdlib, i see 93% of the functions are "OneIterationFunction".
With this change, i see the compilation time of DSE drops from 2.0% to 1.7% of the entire compilation.
This represents 4.3% of all the time spent in SILOptimizations (39.5%).
do not have over 64 locations which makes SmallBitVector a more suitable choice
than BitVector.
I see a ~10% drop in compilation time in DSE. i.e. 1430ms to 1270ms (2.2% to 2.0% of
overall compilation time).
the more compilation time DSE take.
I see a ~10% drop in DSE compilation time, from 2.4% to 2.2%.
(Last time (~1 week ago) i checked DSE was taking 1.4% of the compilation time. Now its taking 2.2%.
I will look into where the increase come from later).
This commit changes the Swift mangler from a utility that writes tokens into a
stream into a name-builder that has two phases: "building a name", and "ready".
This clear separation is needed for the implementation of the compression layer.
Users of the mangler can continue to build the name using the mangleXXX methods,
but to access the results the users of the mangler need to call the finalize()
method. This method can write the result into a stream, like before, or return
an std::string.
If a variable can not escape the function, we mark the store to it as dead before
the function exits.
It was disabled due to some TBAA and side-effect analysis changes.
We were removing 8 dead stores on the stdlib. With this change, we are now removing
203 dead stores.
I only see noise-level performance difference on PerfTestSuite.
I do not see real increase on compilation time either.
Don't allow this optimization to kick in for "inout" args.
The optimization may expose local writes to any aliases of the argument.
I can't prove that is memory safe.
Erik pointed out this case.
This requires a bit of code motion.
e.g.
1. %Tmp = alloc_stack
2. copy_addr %InArg to [initialization] %Tmp
3. DataAddr = init_enum_data_addr %OutArg
4. copy_addr %Tmp#1 to [initialization] %DataAddr
becomes
1. %Tmp = alloc_stack
4. DataAddr = init_enum_data_addr %OutArg
2. copy_addr %InArg to [initialization] %DataAddr
Fixes at least one regression resulting from '++' removal.
See rdar://23874709 [perf] -Onone Execution Time regression of up-to 19%
-Onone results
|.benchmark............|.bestbase.|.bestopt.|..delta.|.%delta.|speedup.|
|.StringWalk...........|....33570.|...20967.|.-12603.|.-37.5%.|..1.60x.|
|.OpenClose............|......446.|.....376.|....-70.|.-15.7%.|..1.19x.|
|.SmallPT..............|....98959.|...83964.|.-14995.|.-15.2%.|..1.18x.|
|.StrToInt.............|....17550.|...16377.|..-1173.|..-6.7%.|..1.07x.|
|.BenchLangCallingCFunc|......453.|.....428.|....-25.|..-5.5%.|..1.06x.|
|.CaptureProp..........|....50758.|...48156.|..-2602.|..-5.1%.|..1.05x.|
|.ProtocolDispatch.....|.....5276.|....5017.|...-259.|..-4.9%.|..1.05x.|
|.Join.................|.....1433.|....1372.|....-61.|..-4.3%.|..1.04x.|
AFAICT, this does not fix any existing bug, but eliminates unverified
assumptions about well-formed SIL, which could be broken by future
optimization.
Forward: The optimization will replace all in-scope uses of the
destination address with the source. With this change we will be sure
not eliminate writes into a destination address unless the destination
is an AllocStackInst. This hasn't been a problem in practice because
the optimization requires an in-scope deinit of the destination
address, which can't happen on typical address projections.
Backward: The optimization will replace in-scope uses of the source
with the destination. With this change we will be sure not to write
into the destination location prior to the copy unless the destination
is an AllocStackInst. This hasn't been a problem in practice because
the optimization requires the copy to be an initialization of the
address, which can't happen on typical address projections.
This change prevents both optimizations without an obvious guarantee
that any dependency on the destination address will manifest as a
SIL-level dependence on the address producer. For example,
init_enum_data_addr would not qualify because it simply projects an
address within a value that may have other dependencies.
Add back a stand-alone devirtualizer pass, running prior to generic
specialization. As with the stand-alone generic specializer pass, this
may add functions to the pass manager's work list.
This is another step in unbundling these passes from the performance
inliner.
This exposed the first interesting bug found by using TermKind, in DCE we were
not properly handling switch_enum_addr and checked_cast_addr_br.
SR-335
rdar://23980060
After the refactoring, RLE runs in the following phases:
Phase 1. we use an iterative data flow to compute whether there is an
available value at a given point, we do not yet care about what the value
is.
Phase 2. we compute the real forwardable value at a given point.
Phase 3. we setup the SILValues for the redundant load elimination.
Phase 4. we perform the redundant load elimination.
Previously we were computing available bit as well as what the available
value is every iteration of the data flow.
I do not see a compilation time improvement though, but this helps to move
to a genset and killset later as we only need to expand Phase 1 into a few smaller
phases to compute genset & killset first and then iterate until convergence for
the data flow.
I verified that we are performing same # of RLE on stdlib before the change.
Existing test ensure correctness.
i.e. multiple different values from predecessors
Previously, RLE is placing the SILArguments and branch edgevalues itself. This is probably
not as reliable/robust as using the SSAupdater.
RLE uses a single SSAupdater to create the SILArguments, this way previously created SILArguments
can be reused.
One test is created specifically for that so that we do not generate extraneous SILArguments.
RLE is an iterative data flow. Functions with too many locations may take a long time for the
data flow to converge.
Once we move to a genset and killset for RLE. we should be able to lessen the condition a bit more.
I have observed no difference in # of redundant loads eliminated on the stdlib (currently we
eliminate 3862 redundant loads).
i.e. multiple different values from predecessors
Previously, RLE is placing the SILArguments and branch edgevalues itself. This is probably
not as reliable/robust as using the SSAupdater.
RLE uses a single SSAupdater to create the SILArguments, this way previously created SILArguments
can be reused.
One test is created specifically for that so that we do not generate extraneous SILArguments.
1. Add some comments regarding how the pass builds and uses genset and killset
2. Merge some similar functions.
3. Rename DSEComputeKind to DSEKind.
4. Some other small comment changes.
Begin unbundling devirtualization, specialization, and inlining by
recreating the stand-alone generic specializer pass.
I've added a use of the pass to the pipeline, but this is almost
certainly not going to be the final location of where it runs. It's
primarily there to ensure this code gets exercised.
Since this is running prior to inlining, it changes the order that some
functions are specialized in, which means differences in the order of
output of one of the tests (one which similarly changed when
devirtualization, specialization, and inlining were bundled together).