LSValue::reduce reduces a set of LSValues (mapped to a set of LSLocations) to
a single LSValue.
It can then be used as the forwarding value for the location.
Previously, we expand into intermediate nodes and leaf nodes and then go bottom
up, trying to create a single LSValue out of the given LSValues.
Instead, we now use a recursion to go top down. This simplifies the code. And this
is fine as we do not expect to run into type tree that are too deep.
Existing test cases ensure correctness.
For forwarding on allocstacks, we can invalidate the forwable bit when we
hit the deallocate stack.
This helps compilation time as we do not need to propagate these bits down
to subsequent basic blocks.
For a release on a guaranteed function paramater, we know right away
that its not the final release and therefore does not call deinit.
Therefore we know it does not read or write memory other than the reference
count.
This reduces the compilation time of dead store and redundant load elim. As
we need to go over alias analysis to make sure tracked locations do not alias
with it.
After collected enough information in the first iteration of the
data flow. We do not do second iteration (last iteration) for blocks
without loads as we will not forward any load there.
This improves compilation time of redundant load elimination.
we do not need to place bogus value in the unreachable blocks in case a SILArgument needs to be
constrcuted for this block's successors.
This relies on simplifycfg or other passes to clean up the CFG before RLE is ran.
isReachable logic is incorrect. This make RLE too conservative in some cases and incorrect in
others .
This fixed ASAN build break caused by commit 925eb2e0d9
I see more redundant loads elim'ed, but I do not see a performance difference with this change.
This patch also implements some of the missing functions used by RLE and DSE in new projection
that exist in the old projection.
New projection provides better memory usage, eventually we will phase out the old projection code.
New projection is now copyable, i.e. we have a proper constructor for it. This helps make the code
more readable.
We do see a bit increase in compilation time in compiling stdlib -O, this is a result of the way
we now get types of a projection path, but I expect this to go down (away) with further improvement
on how memory locations are constructed and cached with later patches.
=== With the OLD Projection. ===
Total amount of memory allocated.
--------------------------------
Bytes Used Count Symbol Name
13032.01 MB 50.6% 2158819 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
2879.70 MB 11.1% 3076018 (anonymous namespace)::ARCSequenceOpts::run()
2663.68 MB 10.3% 1375465 (anonymous namespace)::RedundantLoadElimination::run()
1534.35 MB 5.9% 5067928 (anonymous namespace)::SimplifyCFGPass::run()
1278.09 MB 4.9% 576714 (anonymous namespace)::SILCombine::run()
1052.68 MB 4.0% 935809 (anonymous namespace)::DeadStoreElimination::run()
771.75 MB 2.9% 1677391 (anonymous namespace)::SILCSE::run()
715.07 MB 2.7% 4198193 (anonymous namespace)::GenericSpecializer::run()
434.87 MB 1.6% 652701 (anonymous namespace)::SILSROA::run()
402.99 MB 1.5% 658563 (anonymous namespace)::SILCodeMotion::run()
341.13 MB 1.3% 962459 (anonymous namespace)::DCE::run()
279.48 MB 1.0% 415031 (anonymous namespace)::StackPromotion::run()
Compilation time breakdown.
--------------------------
Running Time Self (ms) Symbol Name
25716.0ms 35.8% 0.0 swift::runSILOptimizationPasses(swift::SILModule&)
25513.0ms 35.5% 0.0 swift::SILPassManager::runOneIteration()
20666.0ms 28.8% 24.0 swift::SILPassManager::runFunctionPasses(llvm::ArrayRef<swift::SILFunctionTransform*>)
19664.0ms 27.4% 77.0 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
3272.0ms 4.5% 12.0 (anonymous namespace)::SimplifyCFGPass::run()
3266.0ms 4.5% 7.0 (anonymous namespace)::ARCSequenceOpts::run()
2608.0ms 3.6% 5.0 (anonymous namespace)::SILCombine::run()
2089.0ms 2.9% 104.0 (anonymous namespace)::SILCSE::run()
1929.0ms 2.7% 47.0 (anonymous namespace)::RedundantLoadElimination::run()
1280.0ms 1.7% 14.0 (anonymous namespace)::GenericSpecializer::run()
1010.0ms 1.4% 45.0 (anonymous namespace)::DeadStoreElimination::run()
966.0ms 1.3% 191.0 (anonymous namespace)::DCE::run()
496.0ms 0.6% 6.0 (anonymous namespace)::SILCodeMotion::run()
=== With the NEW Projection. ===
Total amount of memory allocated.
--------------------------------
Bytes Used Count Symbol Name
11876.64 MB 48.4% 22112349 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
2887.22 MB 11.8% 3079485 (anonymous namespace)::ARCSequenceOpts::run()
1820.89 MB 7.4% 1877674 (anonymous namespace)::RedundantLoadElimination::run()
1533.16 MB 6.2% 5073310 (anonymous namespace)::SimplifyCFGPass::run()
1282.86 MB 5.2% 577024 (anonymous namespace)::SILCombine::run()
772.21 MB 3.1% 1679154 (anonymous namespace)::SILCSE::run()
721.69 MB 2.9% 936958 (anonymous namespace)::DeadStoreElimination::run()
715.08 MB 2.9% 4196263 (anonymous namespace)::GenericSpecializer::run()
Compilation time breakdown.
--------------------------
Running Time Self (ms) Symbol Name
25137.0ms 37.3% 0.0 swift::runSILOptimizationPasses(swift::SILModule&)
24939.0ms 37.0% 0.0 swift::SILPassManager::runOneIteration()
20226.0ms 30.0% 29.0 swift::SILPassManager::runFunctionPasses(llvm::ArrayRef<swift::SILFunctionTransform*>)
19241.0ms 28.5% 83.0 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
3214.0ms 4.7% 10.0 (anonymous namespace)::SimplifyCFGPass::run()
3005.0ms 4.4% 14.0 (anonymous namespace)::ARCSequenceOpts::run()
2438.0ms 3.6% 7.0 (anonymous namespace)::SILCombine::run()
2217.0ms 3.2% 54.0 (anonymous namespace)::RedundantLoadElimination::run()
2212.0ms 3.2% 131.0 (anonymous namespace)::SILCSE::run()
1195.0ms 1.7% 11.0 (anonymous namespace)::GenericSpecializer::run()
1168.0ms 1.7% 39.0 (anonymous namespace)::DeadStoreElimination::run()
853.0ms 1.2% 150.0 (anonymous namespace)::DCE::run()
499.0ms 0.7% 7.0 (anonymous namespace)::SILCodeMotion::run()
enumerate them lazily.
This leads to compilation time improvement, as some of the LSValues previously
enumerated do not be created in this approach.
i.e. we enumerate LSValues created by loads previously, but the LoadInsts could be
target for RLE. In such case, the enumerated LSValues are not used.
Compilation time improvement: 1775ms to 1721ms (2.7% to 2.6% of the entire
compilation time for stdlib -O).
Existing tests ensure correctness.
Note: we still enumerate locations, as we need to know how many locations there are
in the function to resize the bitvector appropriately before the data flow runs.
1. Update some comments.
2. Rename a few functions, e.g. runIterativeDF -> runIterativeRLE, getLSValueBit -> getValueBit.
3. Remove unused headers.
4. Remove no-longer used function, mergePredecessorStates and mergePredecessorState.
5. A few other small NFCs.
Previously we process every instruction every time the data flow re-iterates.
This is very inefficient.
In addition to moving to genset and killset, we also group function into
OneIterationFunction which we know that the data flow would converge in 1 iteration
and functions that requre the iterative data flow, mostly due to backedges in loops.
we process them differently.
I observed that there are ~93% of the functions that require just a single iteration
to perform the RLE.
But the other 7% accounts for 2321 (out of 6318) of the redundant loads we eliminated.
This change reduces RLE compilation from 4.1% to 2.7% of the entire compilation time
(frontend+OPT+LLVM) on stdlib with -O. This represents 6.9% of the time spent
in SILOptimizations (38.8%).
~2 weeks ago, RLE was taking 1.9% of the entire compilation time. It rose to 4.1%
mostly due to that we are now eliminating many more redundant loads (mostly thank
to Erik's integragtion of escape analysis in alias analysis). i.e. 3945 redundant
loads elimnated before Erik's change to 6318 redundant loads eliminated now.
After the refactoring, RLE runs in the following phases:
Phase 1. we use an iterative data flow to compute whether there is an
available value at a given point, we do not yet care about what the value
is.
Phase 2. we compute the real forwardable value at a given point.
Phase 3. we setup the SILValues for the redundant load elimination.
Phase 4. we perform the redundant load elimination.
Previously we were computing available bit as well as what the available
value is every iteration of the data flow.
I do not see a compilation time improvement though, but this helps to move
to a genset and killset later as we only need to expand Phase 1 into a few smaller
phases to compute genset & killset first and then iterate until convergence for
the data flow.
I verified that we are performing same # of RLE on stdlib before the change.
Existing test ensure correctness.
i.e. multiple different values from predecessors
Previously, RLE is placing the SILArguments and branch edgevalues itself. This is probably
not as reliable/robust as using the SSAupdater.
RLE uses a single SSAupdater to create the SILArguments, this way previously created SILArguments
can be reused.
One test is created specifically for that so that we do not generate extraneous SILArguments.
RLE is an iterative data flow. Functions with too many locations may take a long time for the
data flow to converge.
Once we move to a genset and killset for RLE. we should be able to lessen the condition a bit more.
I have observed no difference in # of redundant loads eliminated on the stdlib (currently we
eliminate 3862 redundant loads).
i.e. multiple different values from predecessors
Previously, RLE is placing the SILArguments and branch edgevalues itself. This is probably
not as reliable/robust as using the SSAupdater.
RLE uses a single SSAupdater to create the SILArguments, this way previously created SILArguments
can be reused.
One test is created specifically for that so that we do not generate extraneous SILArguments.
(libraries now)
It has been generally agreed that we need to do this reorg, and now
seems like the perfect time. Some major pass reorganization is in the
works.
This does not have to be the final word on the matter. The consensus
among those working on the code is that it's much better than what we
had and a better starting point for future bike shedding.
Note that the previous organization was designed to allow separate
analysis and optimization libraries. It turns out this is an
artificial distinction and not an important goal.