Generalize the AccessUseDefChainCloner in MemAccessUtils. It was
always meant to work this way, just needed a client.
Add a new API AccessUseDefChainCloner::canCloneUseDefChain().
Add a bailout for begin_borrow and mark_dependence. Those
projections may appear on an access path, but they can't be
individually cloned without compensating.
Delete InteriorPointerAddressRebaseUseDefChainCloner.
Add a check in OwnershipRAUWHelper for canCloneUseDefChain.
Add test cases for begin_borrow and mark_dependence.
For combined load-store hoisting, split loads that contain the
loop-stored value into a single load from the same address as the
loop-stores, and a set of loads disjoint from the loop-stores. The
single load will be hoisted while sinking the stores to the same
address. The disjoint loads will be hoisted normally in a subsequent
iteration on the same loop.
loop:
load %outer
store %inner1
exit:
Will be split into
loop:
load %inner1
load %inner2
store %inner1
exit:
Then, combined load/store hoisting will produce:
load %inner1
loop:
load %inner2
exit:
store %inner1
The LICM algorithm was not robust with respect to address projection
because it identifies a projected address by its SILValue. This should
never be done! Use AccessPath instead.
Fixes regressions caused by rdar://66791257 (Print statement provokes
"Can't unsafeBitCast between types of different sizes" when
optimizations enabled)
Add AccesssedStorage::compute and computeInScope to mirror AccessPath.
Allow recovering the begin_access for Nested storage.
Adds AccessedStorage.visitRoots().
Things that have come up recently but are somewhat blocked on this:
- Moving AccessMarkerElimination down in the pipeline
- SemanticARCOpts correctness and improvements
- AliasAnalysis improvements
- LICM performance regressions
- RLE/DSE improvements
Begin to formalize the model for valid memory access in SIL. Ignoring
ownership, every access is a def-use chain in three parts:
object root -> formal access base -> memory operation address
AccessPath abstracts over this path and standardizes the identity of a
memory access throughout the optimizer. This abstraction is the basis
for a new AccessPathVerification.
With that verification, we now have all the properties we need for the
type of analysis requires for exclusivity enforcement, but now
generalized for any memory analysis. This is suitable for an extremely
lightweight analysis with no side data structures. We currently have a
massive amount of ad-hoc memory analysis throughout SIL, which is
incredibly unmaintainable, bug-prone, and not performance-robust. We
can begin taking advantage of this verifably complete model to solve
that problem.
The properties this gives us are:
Access analysis must be complete over memory operations: every memory
operation needs a recognizable valid access. An access can be
unidentified only to the extent that it is rooted in some non-address
type and we can prove that it is at least *not* part of an access to a
nominal class or global property. Pointer provenance is also required
for future IRGen-level bitfield optimizations.
Access analysis must be complete over address users: for an identified
object root all memory accesses including subobjects must be
discoverable.
Access analysis must be symmetric: use-def and def-use analysis must
be consistent.
AccessPath is merely a wrapper around the existing accessed-storage
utilities and IndexTrieNode. Existing passes already very succesfully
use this approach, but in an ad-hoc way. With a general utility we
can:
- update passes to use this approach to identify memory access,
reducing the space and time complexity of those algorithms.
- implement an inexpensive on-the-fly, debug mode address lifetime analysis
- implement a lightweight debug mode alias analysis
- ultimately improve the power, efficiency, and maintainability of
full alias analysis
- make our type-based alias analysis sensistive to the access path
It is legal for the optimizer to consider code after a loop always
reachable, but when a loop has no exits, or when the loops exits are
dominated by a conditional statement, we should not consider
conditional statements within the loop as dominating all possible
execution paths through the loop. At least not when there is at least
one path through the loop that contains a "synchronization point",
such as a function that may contain a memory barrier, perform I/O, or
exit the program.
Sadly, we still don't model synchronization points in the optimizer,
so we need to conservatively assume all loops have a synchronization
point and avoid hoisting conditional traps that may never be executed.
Fixes rdar://66791257 (Print statement provokes "Can't unsafeBitCast
between types of different sizes" when optimizations enabled)
Originated in 2014.
Specifically:
1. I made methods, variables camelCase.
2. I expanded out variable names (e.x.: bb -> block, predBB -> predBlocks, U -> wrappedUse).
3. I changed typedef -> using.
4. I changed a few c style for loops into for each loops using llvm::enumerate.
NOTE: I left the parts needed for syncing to LLVM in the old style since LLVM
needs these to exist for CRTP to work correctly for the SILSSAUpdater.
For use outside access enforcement passes.
Add isUniquelyIdentifiedAfterEnforcement.
Rename functions for clarity and generality.
Rename isUniquelyIdentifiedOrClass to isFormalAccessBase.
Rename findAccessedStorage to identifyFormalAccess.
Rename findAccessedStorageNonNested to findAccessedStorage.
Part of generalizing the utility for use outside the access
enforcement passes.
Even if a store is not dominating the loop exits, it makes sense to move it out of the loop if the pre-header also as a store to the same memory location.
When this is done, dead-store-elimination can then most likely remove the store in the pre-header.
This loop optimization hoists and sinks a group of loads and stores to
the same address.
Consider this SIL...
PRELOOP:
%stackAddr = alloc_stack $Index
%outerAddr1 = struct_element_addr %stackAddr : $*Index, #Index.value
%innerAddr1 = struct_element_addr %outerAddr1 : $*Int, #Int._value
%outerAddr2 = struct_element_addr %stackAddr : $*Index, #Index.value
%innerAddr2 = struct_element_addr %outerAddr2 : $*Int, #Int._value
LOOP:
%_ = load %innerAddr2 : $*Builtin.Int64
store %_ to %outerAddr2 : $*Int
%_ = load %innerAddr1 : $*Builtin.Int64
There are two bugs:
1) LICM miscompiles code during combined load/store hoisting and sinking.
When the loop contains an aliasing load from a difference projection
value, the optimization sinks the store but never replaces the
load. At runtime, the load reads a stale value.
FIX: isOnlyLoadedAndStored needs to check for other load instructions
before hoisting/sinking a seemingly unrelated set of
loads/stores. Checking side effect instructions is insufficient. The
same bug could happen with stores, which also do not produce side
effects.
Fixes <rdar://61246061> LICM miscompile:
Combined load/store hoisting/sinking with aliases
2) The LICM algorithm is not robust with respect to address projection
because it identifies a projected address by its SILValue. This
should never be done! It is trivial to represent a project path
using an IndexTrieNode (there is also an abstraction called
"ProjectionPath", but it should _never_ actually be stored by an
analysis because of the time and space complexity of doing so).
The second bug is not necessary to fix for correctness, so will be
fixed in a follow-up commit.
Global initializers are executed only once.
Therefore it's possible to hoist such an initializer call to a loop pre-header - in case there are no conflicting side-effects in the loop before the call.
Also, the call must post-dominate the loop pre-header. Otherwise it would be executed speculatively.
and eliminate dead code. This is meant to be a replacement for the utility:
recursivelyDeleteTriviallyDeadInstructions. The new utility performs more aggresive
dead-code elimination for ownership SIL.
This patch also migrates most non-force-delete uses of
recursivelyDeleteTriviallyDeadInstructions to the new utility.
and migrates one force-delete use of recursivelyDeleteTriviallyDeadInstructions
(in IRGenPrepare) to use the new utility.
This is a combination of load hoisting and store sinking, e.g.
preheader:
br header_block
header_block:
%x = load %not_aliased_addr
// use %x and define %y
store %y to %not_aliased_addr
...
exit_block:
is transformed to:
preheader:
%x = load %not_aliased_addr
br header_block
header_block:
// use %x and define %y
...
exit_block:
store %y to %not_aliased_addr
This is a combination of load hoisting and store sinking, e.g.
preheader:
br header_block
header_block:
%x = load %not_aliased_addr
// use %x and define %y
store %y to %not_aliased_addr
...
exit_block:
is transformed to:
preheader:
%x = load %not_aliased_addr
br header_block
header_block:
// use %x and define %y
...
exit_block:
store %y to %not_aliased_addr
The XXOptUtils.h convention is already established and parallels
the SIL/XXUtils convention.
New:
- InstOptUtils.h
- CFGOptUtils.h
- BasicBlockOptUtils.h
- ValueLifetime.h
Removed:
- Local.h
- Two conflicting CFG.h files
This reorganization is helpful before I introduce more
utilities for block cloning similar to SinkAddressProjections.
Move the control flow utilies out of Local.h, which was an
unreadable, unprincipled mess. Rename it to InstOptUtils.h, and
confine it to small APIs for working with individual instructions.
These are the optimizer's additions to /SIL/InstUtils.h.
Rename CFG.h to CFGOptUtils.h and remove the one in /Analysis. Now
there is only SIL/CFG.h, resolving the naming conflict within the
swift project (this has always been a problem for source tools). Limit
this header to low-level APIs for working with branches and CFG edges.
Add BasicBlockOptUtils.h for block level transforms (it makes me sad
that I can't use BBOptUtils.h, but SIL already has
BasicBlockUtils.h). These are larger APIs for cloning or removing
whole blocks.
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances in the swift repo.
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Consider the attached test cases:
We have a begin_access [dynamic] to a global inside of a loop
There’s a nested conflict on said access due to an apply() instruction between the begin and end accesses.
LICM is currently very conservative: If there are any function calls inside of the loop that conflict with begin and end access, we do not hoist out of the loop.
However, if all conflicting applies are “sandwiched” between the begin and end access. So there’s no reason we can’t hoist out of the loop.
See radar rdar://problem/43660965 - this improves some internal benchmarks by over 3X
In some instances, some instructions, like ref_element_addr, can be hoisted outside of loops even if they are not guaranteed to be executed.
We currently don’t support that / bail. We only try to do so / do further analysis only for begin_access because they are extremely heavy.
However, we need to support hosting of ref_element_addr in that case, if it does not have a loop dependent operand, in order to be able to hoist begin_access instructions in some benchmarks.
Initial local testing shows that this PR, when we enable exclusivity, improves the performance of a certain internal benchmark by over 40%
See rdar://problem/43623829
We can’t hoist everything that is hoist-able
The canHoist method does not do all the required analysis
Some of the work is done at COW Array Opt
TODO: Refactor COW Array Opt + canHoist - radar 41601468
Major refactoring + tuning of LICM. Includes:
Support for hosting more array semantic calls
Remove restrictions for sinking instructions
Add support for hoisting and sinking instruction pairs (begin and end accesses)
Testing with Exclusivity enabled on a couple of benchmarks shows:
ReversedArray 7x improvement
StringWalk 2.6x improvement
Removing this optimization from SIL: It is not worth the extra code complexity and compilation time.
More in-depth explanation for the reasoning behind my decision:
1) What is being done there is obviously not LICM (more below) - even if it is useful it should be its own separate optimization
2) The regression that caused us to add this code is no longer there in most cases - 10% in only one specific corner-case
3) Even if the regression was still there, this is an extremely specific code pattern that we are pattern-matching against. Said pattern would be hard to find in any real code.
There is a small code snippet in rdar://17451529 that caused us to add this optimization. Looking at it now we see that the only difference is in loop1 example -
The only difference in SIL level is in loop 1:
%295 = tuple_extract %294 : $(Builtin.Int64, Builtin.Int1), 0
%296 = tuple_extract %294 : $(Builtin.Int64, Builtin.Int1), 1
cond_fail %296 : $Builtin.Int1
%298 = struct $Int (%295 : $Builtin.Int64)
store %298 to %6 : $*Int
%300 = builtin "cmp_eq_Int64"(%292 : $Builtin.Int64, %16 : $Builtin.Int64) : $Builtin.Int1
cond_br %300, bb1, bb12
The cond_fail instruction in said loop is moved below the store instruction / above the builtin.
Looking at the resulting IR. And how LLVM optimizes it. It is almost the same.
If we look at the assembly code being executed then, before removing this optimization, we have:
LBB0_11:
testq %rcx, %rcx
je LBB0_2
decq %rcx
incq %rax
movq %rax, _$S4main4sum1Sivp(%rip)
jno LBB0_11
After removing it we have:
LBB0_11:
incq %rax
testq %rcx, %rcx
je LBB0_2
decq %rcx
movq %rax, %rdx
incq %rdx
jno LBB0_11
There is no extra load/movq which was mentioned the radar.
Make this a generic analysis so that it can be used to analyze any
kind of function effect.
FunctionSideEffect becomes a trivial specialization of the analysis.
The immediate need for this is to introduce an new
AccessedStorageAnalysis, although I foresee it as a generally very
useful utility. This way, new kinds of function effects can be
computed without adding any complexity or compile time to
FunctionSideEffects. We have the flexibility of computing different
kinds of function effects at different points in the pipeline.
In the case of AccessedStorageAnalysis, it will compute both
FunctionSideEffects and FunctionAccessedStorage in the same pass by
implementing a simple wrapper on top of FunctionEffects.
This cleanup reflects my feeling that nested classes make the code
extremely unreadable unless they are very small and either private or
only used directly via its parent class. It's easier to see how these
classes compose with a flat type system.
In addition to enabling new kinds of function effects analyses, I
think this makes the implementation of side effect analysis easier to
understand by separating concerns.