LSValue::reduce reduces a set of LSValues (mapped to a set of LSLocations) to
a single LSValue.
It can then be used as the forwarding value for the location.
Previously, we expand into intermediate nodes and leaf nodes and then go bottom
up, trying to create a single LSValue out of the given LSValues.
Instead, we now use a recursion to go top down. This simplifies the code. And this
is fine as we do not expect to run into type tree that are too deep.
Existing test cases ensure correctness.
For forwarding on allocstacks, we can invalidate the forwable bit when we
hit the deallocate stack.
This helps compilation time as we do not need to propagate these bits down
to subsequent basic blocks.
For a release on a guaranteed function paramater, we know right away
that its not the final release and therefore does not call deinit.
Therefore we know it does not read or write memory other than the reference
count.
This reduces the compilation time of dead store and redundant load elim. As
we need to go over alias analysis to make sure tracked locations do not alias
with it.
This patch also implements some of the missing functions used by RLE and DSE in new projection
that exist in the old projection.
New projection provides better memory usage, eventually we will phase out the old projection code.
New projection is now copyable, i.e. we have a proper constructor for it. This helps make the code
more readable.
We do see a bit increase in compilation time in compiling stdlib -O, this is a result of the way
we now get types of a projection path, but I expect this to go down (away) with further improvement
on how memory locations are constructed and cached with later patches.
=== With the OLD Projection. ===
Total amount of memory allocated.
--------------------------------
Bytes Used Count Symbol Name
13032.01 MB 50.6% 2158819 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
2879.70 MB 11.1% 3076018 (anonymous namespace)::ARCSequenceOpts::run()
2663.68 MB 10.3% 1375465 (anonymous namespace)::RedundantLoadElimination::run()
1534.35 MB 5.9% 5067928 (anonymous namespace)::SimplifyCFGPass::run()
1278.09 MB 4.9% 576714 (anonymous namespace)::SILCombine::run()
1052.68 MB 4.0% 935809 (anonymous namespace)::DeadStoreElimination::run()
771.75 MB 2.9% 1677391 (anonymous namespace)::SILCSE::run()
715.07 MB 2.7% 4198193 (anonymous namespace)::GenericSpecializer::run()
434.87 MB 1.6% 652701 (anonymous namespace)::SILSROA::run()
402.99 MB 1.5% 658563 (anonymous namespace)::SILCodeMotion::run()
341.13 MB 1.3% 962459 (anonymous namespace)::DCE::run()
279.48 MB 1.0% 415031 (anonymous namespace)::StackPromotion::run()
Compilation time breakdown.
--------------------------
Running Time Self (ms) Symbol Name
25716.0ms 35.8% 0.0 swift::runSILOptimizationPasses(swift::SILModule&)
25513.0ms 35.5% 0.0 swift::SILPassManager::runOneIteration()
20666.0ms 28.8% 24.0 swift::SILPassManager::runFunctionPasses(llvm::ArrayRef<swift::SILFunctionTransform*>)
19664.0ms 27.4% 77.0 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
3272.0ms 4.5% 12.0 (anonymous namespace)::SimplifyCFGPass::run()
3266.0ms 4.5% 7.0 (anonymous namespace)::ARCSequenceOpts::run()
2608.0ms 3.6% 5.0 (anonymous namespace)::SILCombine::run()
2089.0ms 2.9% 104.0 (anonymous namespace)::SILCSE::run()
1929.0ms 2.7% 47.0 (anonymous namespace)::RedundantLoadElimination::run()
1280.0ms 1.7% 14.0 (anonymous namespace)::GenericSpecializer::run()
1010.0ms 1.4% 45.0 (anonymous namespace)::DeadStoreElimination::run()
966.0ms 1.3% 191.0 (anonymous namespace)::DCE::run()
496.0ms 0.6% 6.0 (anonymous namespace)::SILCodeMotion::run()
=== With the NEW Projection. ===
Total amount of memory allocated.
--------------------------------
Bytes Used Count Symbol Name
11876.64 MB 48.4% 22112349 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
2887.22 MB 11.8% 3079485 (anonymous namespace)::ARCSequenceOpts::run()
1820.89 MB 7.4% 1877674 (anonymous namespace)::RedundantLoadElimination::run()
1533.16 MB 6.2% 5073310 (anonymous namespace)::SimplifyCFGPass::run()
1282.86 MB 5.2% 577024 (anonymous namespace)::SILCombine::run()
772.21 MB 3.1% 1679154 (anonymous namespace)::SILCSE::run()
721.69 MB 2.9% 936958 (anonymous namespace)::DeadStoreElimination::run()
715.08 MB 2.9% 4196263 (anonymous namespace)::GenericSpecializer::run()
Compilation time breakdown.
--------------------------
Running Time Self (ms) Symbol Name
25137.0ms 37.3% 0.0 swift::runSILOptimizationPasses(swift::SILModule&)
24939.0ms 37.0% 0.0 swift::SILPassManager::runOneIteration()
20226.0ms 30.0% 29.0 swift::SILPassManager::runFunctionPasses(llvm::ArrayRef<swift::SILFunctionTransform*>)
19241.0ms 28.5% 83.0 swift::SILPassManager::runPassesOnFunction(llvm::ArrayRef<swift::SILFunctionTransform*>, swift::SILFunction*)
3214.0ms 4.7% 10.0 (anonymous namespace)::SimplifyCFGPass::run()
3005.0ms 4.4% 14.0 (anonymous namespace)::ARCSequenceOpts::run()
2438.0ms 3.6% 7.0 (anonymous namespace)::SILCombine::run()
2217.0ms 3.2% 54.0 (anonymous namespace)::RedundantLoadElimination::run()
2212.0ms 3.2% 131.0 (anonymous namespace)::SILCSE::run()
1195.0ms 1.7% 11.0 (anonymous namespace)::GenericSpecializer::run()
1168.0ms 1.7% 39.0 (anonymous namespace)::DeadStoreElimination::run()
853.0ms 1.2% 150.0 (anonymous namespace)::DCE::run()
499.0ms 0.7% 7.0 (anonymous namespace)::SILCodeMotion::run()
This change is needed for the next update to ToT LLVM. It can be put
into place now without breaking anything so I am committing it now.
The churn upstream on ilist_node is neccessary to remove undefined
behavior. Rather than updating the different ilist_node patches for the
hacky change required to not use iterators, just use iterators and keep
everything as ilist_nodes. Upstream they want to eventually do this, so
it makes sense for us to just do it now.
Please do not introduce new invocations of
ilist_node::get{Next,Prev}Node() into the tree.
Now we have 3 cases.
1. OptimizeNone (for functions with too many basicblocks and too many locations). Simply return.
2. Pessimisitc single iteration data flow (for functions with many basic blocks and many locations).
3. Optimistic multiple iteration data flow (for functions with some basic blocks and some locations
and require iterative data flow).
With this change stdlib and stdlibunittest has some changes in dead store(DS)
eliminated.
stdlib: 202 -> 203 DS.
stdlibunittest: 42 - 39 DS.
Compilation time improvement: with this change on a RELEASE+ASSERT compiler for stdlibunittest.
Running Time Self (ms) Symbol Name
5525.0ms 5.3% 25.0 (anonymous namespace)::ARCSequenceOpts::run()
3500.0ms 3.4% 25.0 (anonymous namespace)::RedundantLoadElimination::run()
3050.0ms 2.9% 25.0 (anonymous namespace)::SILCombine::run()
2700.0ms 2.6% 0.0 (anonymous namespace)::SimplifyCFGPass::run()
2100.0ms 2.0% 75.0 (anonymous namespace)::SILCSE::run()
1450.0ms 1.4% 0.0 (anonymous namespace)::DeadStoreElimination::run()
750.0ms 0.7% 75.0 (anonymous namespace)::DCE::run()
Compilation time improvement: with this change on a DEBUG compiler for stdlibunittest.
Running Time Self (ms) Symbol Name
42300.0ms 4.9% 50.0 (anonymous namespace)::ARCSequenceOpts::run()
35875.0ms 4.1% 0.0 (anonymous namespace)::RedundantLoadElimination::run()
30475.0ms 3.5% 0.0 (anonymous namespace)::SILCombine::run()
19675.0ms 2.3% 0.0 (anonymous namespace)::SILCSE::run()
18150.0ms 2.1% 25.0 (anonymous namespace)::SimplifyCFGPass::run()
12475.0ms 1.4% 0.0 (anonymous namespace)::DeadStoreElimination::run()
5775.0ms 0.6% 0.0 (anonymous namespace)::DCE::run()
I do not see a compilation time change in stdlib.
Existing tests ensure correctness.
functions for dead store elimination to handle. In the worst case, The number of
memory behavior or alias queries we need to do is roughly linear to
the # BBs x(times) # of locations.
Put in some heuristic to trade off accuracy for compilation time.
NOTE: we are not disabling DSE for these offending functions. instead we are running
a one iteration pessimistic data flow as supposed to the multiple iteration optimistic
data flow we've done previously.
With this change. I see compilation time on StdlibUnitTest drops significantly.
50%+ drop in time spent in DSE in StdlibUnit with a release compiler.
I will update more Instruments data post-commit once i get close to my desktop.
I see a slight drop in # of dead stores (DS) elimination in stdlib and stdlibUnit test.
stdlib: 203 DS -> 202 DS. (RLE is affected slightly as well. 6313 -> 6295 RL).
stdlibunittest : 43 DS -> 42. (RLE is not affected).
We are passing all existing dead store tests.
In StdlibUnitTest, there is this function that has too many (2450) LSLocations
and the data flow in DSE takes too long to converge.
StdlibUnittest.TestSuite.(addForwardRangeReplaceableCollectionTests <A, B where A: Swift.RangeReplaceableCollectionType, B: Swift.RangeReplaceableCollectionType, A.SubSequence: Swift.CollectionType, B.Generator.Element: Swift.Equatable, A.SubSequence == A.SubSequence.SubSequence, A.Generator.Element == A.SubSequence.Generator.Element> (Swift.String, makeCollection : ([A.Generator.Element]) -> A, wrapValue : (StdlibUnittest.OpaqueValue<Swift.Int>) -> A.Generator.Element, extractValue : (A.Generator.Element) -> StdlibUnittest.OpaqueValue<Swift.Int>, makeCollectionOfEquatable : ([B.Generator.Element]) -> B, wrapValueIntoEquatable : (StdlibUnittest.MinimalEquatableValue) -> B.Generator.Element, extractValueFromEquatable : (B.Generator.Element) -> StdlibUnittest.MinimalEquatableValue, checksAdded : StdlibUnittest.Box<Swift.Set<Swift.String>>, resiliencyChecks : StdlibUnittest.CollectionMisuseResiliencyChecks, outOfBoundsIndexOffset : Swift.Int) -> ()).(closure #18)
This function alone takes ~20% of the total amount of time spent in DSE in StdlibUnitTest.
And DSE does not eliminate any dead store in the function either. I added this threshold
to abort on functions that have too many LSLocations.
I see no difference in # of dead store eliminated in the Stdlib.
If we know a function is not a one iteration function which means its
its BBWriteSetIn and BBWriteSetOut have been computed and converged,
and a basic block does not even have StoreInsts, there is no point
in processing every instruction in the last iteration of the data flow
again as no store will be eliminated.
We can simply skip the basic block and rely on the converged BBWriteSetIn
to process its predecessors.
Compilation time improvement: 1.7% to 1.5% of overall compilation time.
on stdlib -O. This represents a 4.0% of all SILOptimzations(37.2%)
Existing tests ensure correctness.
of which a single post-order would be enough for DSE. In this case, we do not really
need to compute the genset and killset (which is a costly operation).
On stdlib, i see 93% of the functions are "OneIterationFunction".
With this change, i see the compilation time of DSE drops from 2.0% to 1.7% of the entire compilation.
This represents 4.3% of all the time spent in SILOptimizations (39.5%).
do not have over 64 locations which makes SmallBitVector a more suitable choice
than BitVector.
I see a ~10% drop in compilation time in DSE. i.e. 1430ms to 1270ms (2.2% to 2.0% of
overall compilation time).
the more compilation time DSE take.
I see a ~10% drop in DSE compilation time, from 2.4% to 2.2%.
(Last time (~1 week ago) i checked DSE was taking 1.4% of the compilation time. Now its taking 2.2%.
I will look into where the increase come from later).
If a variable can not escape the function, we mark the store to it as dead before
the function exits.
It was disabled due to some TBAA and side-effect analysis changes.
We were removing 8 dead stores on the stdlib. With this change, we are now removing
203 dead stores.
I only see noise-level performance difference on PerfTestSuite.
I do not see real increase on compilation time either.
1. Add some comments regarding how the pass builds and uses genset and killset
2. Merge some similar functions.
3. Rename DSEComputeKind to DSEKind.
4. Some other small comment changes.