* re-implement the pass in swift
* support alloc_stack liveranges which span over multiple basic blocks
* support `load`-`store` pairs, copying from the alloc_stack (in addition to `copy_addr`)
Those improvements help to reduce temporary stack allocations, especially for InlineArrays.
rdar://151606382
Beside cleaning up the source code, the motivation for the translation into Swift is to make it easier to improve the pass for some InlineArray specific optimizations (though I'm not sure, yet if we really need those).
Also, the new implementation doesn't contain the optimize-store-into-temp optimization anymore, because this is covered by redundant load elimination.
Propagating array element values is done by load-simplification and redundant-load-elimination.
So ArrayElementPropagation is not needed anymore.
ArrayElementPropagation also replaced `Array.append(contentsOf:)` with individual `Array.append` calls.
This optimization is removed, because the benefit is questionably, anyway.
In most cases it resulted in a code size increase.
MandatoryPerformanceOptimizations already did most of the vtable specialization work.
So it makes sense to remove the VTableSpecializerPass completely and do everything in MandatoryPerformanceOptimizations.
- VTableSpecializer, a new pass that synthesizes a new vtable per each observed concrete type used
- Don't use full type metadata refs in embedded Swift
- Lazily emit specialized class metadata (LazySpecializedClassMetadata) in IRGen
- Don't emit regular class metadata for a class decl if it's generic (only emit the specialized metadata)
The new implementation has several benefits compared to the old C++ implementation:
* It is significantly simpler. It optimizes each load separately instead of all at once with bit-field based dataflow.
* It's using alias analysis more accurately which enables more loads to be optimized
* It avoids inserting additional copies in OSSA
The algorithm is a data flow analysis which starts at the original load and searches for preceding stores or loads by following the control flow in backward direction.
The preceding stores and loads provide the "available values" with which the original load can be replaced.
The old C++ pass didn't catch a few cases.
Also:
* The new pass is significantly simpler: it doesn't perform dataflow for _all_ memory locations at once using bitfields, but handles each store separately. (In both implementations there is a complexity limit in place to avoid quadratic complexity)
* The new pass works with OSSA
This will turn `partial_apply` instructions into explicit box construction and
extraction code sequences. To begin with, recognize when a private function
is only used in partial applications and directly modify the function to be
usable as a closure invocation function. This simplifies the lowering in IRGen
and avoids generating a "partial application forwarder" thunk.
Extract and rewrite the destroy hoisting algorithm originally from
CopyForwarding (in 2014).
This is now a light-weight utility for hoisting destroy_addr
instructions. Shrinking an object's memory lifetime can allow removal
of copy_addr and other optimization.
This is extremely low-overhead and can run at any optimization level
without dependency on any analysis.
This algorithm is:
- Incremental
- SSA-based
- Canonical
- Free from alias analysis
See file-level comments.
The immediate purpose is to specify and test the constraints
introduced by adding lexical variable lifetimes to SIL semantics. It
can be used as a template for end_borrow hoisting.
Ultimately, this utility can be invoked within any pass that needs to
optimize a particular uniquely identified address. It will be used to
remove much of the complexity from CopyForwarding.
TLDR: The reason why I am doing this is that often times people confuse assembly
vision remarks for normal opt remarks. I want to accentuate that this is
actually trying to do something different than a traditional opt remark. To that
end I renamed things in the compiler and added a true attribute
`@_assemblyVision` to trigger the compiler to emit these remarks to help
everyone remember what this is in their ontology. I explain below the
difference.
----
Normal opt remarks work by the optimizer telling you if it succeeded or failed
to perform an optimization. Another way of putting this is that opt remarks is
trying to give back feedback to the user from an expert system about why it did
or not do something. There is inherently an act of interpretation in the
optimizer about whether or not to report an 'action' that it perpetrated to the
user.
Assembly Vision Remarks is instead trying to be an expert tool that acts like an
xray. Instead of telling the user about what the optimizer did, it is instead a
simple visitor that visits the IR and emits SourceLocations for where specific
hazards ending up in the program. In this sense it is just telling the user
where certain instructions ended up and using heuristics to relate this to
information at the IR level. To a get a sense of this difference, consider the
following Swift Code:
```
public class Klass {
func doSomething() {}
}
var global: Klass = Klass()
@inline(__always)
func bar() -> Klass { global }
@_assemblyVision
@inline(never)
func foo() {
bar().doSomething()
}
```
In this case, we will emit the following remarks:
```
test.swift:16:5: remark: begin exclusive access to value of type 'Klass'
bar().doSomething()
^
test.swift:7:5: note: of 'global'
var global: Klass = Klass()
^
test.swift:16:9: remark: end exclusive access to value of type 'Klass'
bar().doSomething()
^
test.swift:7:5: note: of 'global'
var global: Klass = Klass()
^
test.swift:16:11: remark: retain of type 'Klass'
bar().doSomething()
^
test.swift:7:5: note: of 'global'
var global: Klass = Klass()
^
test.swift:16:23: remark: release of type 'Klass'
bar().doSomething()
^
test.swift:7:5: note: of 'global'
var global: Klass = Klass()
^
```
Notice how the begin/end exclusive access are marked as actually being before
the retain, release of global. That seems weird since exclusive access to memory
seems like something that should not escape an exclusivity scope... but in fact
this corresponds directly to what we eventually see in the SIL:
```
// test.sil
sil hidden [noinline] [_semantics "optremark"] @$ss3fooyyF : $@convention(thin) () -> () {
bb0:
%0 = global_addr @$ss6globals5KlassCvp : $*Klass
%1 = begin_access [read] [dynamic] [no_nested_conflict] %0 : $*Klass
%2 = load %1 : $*Klass
end_access %1 : $*Klass
%4 = class_method %2 : $Klass, #Klass.doSomething : (Klass) -> () -> (), $@convention(method) (@guaranteed Klass) -> ()
strong_retain %2 : $Klass
%6 = apply %4(%2) : $@convention(method) (@guaranteed Klass) -> ()
strong_release %2 : $Klass
%8 = tuple ()
return %8 : $()
} // end sil function '$ss3fooyyF'
```
and assembly,
```
// test.S
_$ss3fooyyF:
pushq %rbp
movq %rsp, %rbp
pushq %r13
pushq %rbx
subq $32, %rsp
leaq _$ss6globals5KlassCvp(%rip), %rdi
leaq -40(%rbp), %rsi
xorl %edx, %edx
xorl %ecx, %ecx
callq _swift_beginAccess
movq _$ss6globals5KlassCvp(%rip), %r13
movq (%r13), %rax
movq 80(%rax), %rbx
movq %r13, %rdi
callq _swift_retain
callq *%rbx
movq %r13, %rdi
callq _swift_release
addq $32, %rsp
popq %rbx
popq %r13
popq %rbp
retq
```
so as one can see what we are trying to do is inform the user of hazards in the
code without trying to reason about it, automated a task that users often have
to perform by hand: inspection of assembly to determine where runtime calls and
other hazards ended up.
SemanticARCOpts keeps on growing with various optimizations attached to a single
"optimization" manager. Move it to its own folder in prepation for splitting it
into multiple different optimizations and utility files.
Optimizes String operations with constant operands.
Specifically:
* Replaces x.append(y) with x = y if x is empty.
* Removes x.append("")
* Replaces x.append(y) with x = x + y if x and y are constant strings.
* Replaces _typeName(T.self) with a constant string if T is statically known.
With this optimization it's possible to constant fold string interpolations, like "the \(Int.self) type" -> "the Int type"
This new pass runs on high-level SIL, where semantic calls are still in place.
rdar://problem/65642843
We do allow for limited def-use traversal to do things like find debug_value.
But nothing like a full data flow pass. As a proof of concept I just emit
opt-remarks when we see retains, releases. Eventually I would like to also print
out which lvalue the retain is from but that is more than I want to do with this
proof of concept.
Optimizes copies from a temporary (an "l-value") to a destination.
%temp = alloc_stack $Ty
instructions_which_store_to %temp
copy_addr [take] %temp to %destination
dealloc_stack %temp
is optimized to
destroy_addr %destination
instructions_which_store_to %destination
The name TempLValueOpt refers to the TempRValueOpt pass, which performs a related transformation, just with the temporary on the "right" side.
The TempLValueOpt is similar to CopyForwarding::backwardPropagateCopy.
It's more restricted (e.g. the copy-source must be an alloc_stack).
That enables other patterns to be optimized, which backwardPropagateCopy cannot handle.
This pass also performs a small peephole optimization which simplifies copy_addr - destroy sequences.
copy_addr %source to %destination
destroy_addr %source
is replace with
copy_addr [take] %source to %destination
Start with some high-level checks of whether a declaration is formally final, or has no visible overrides
in the domain of the program visible to the SIL module.
We can eventually adopt more of the logic from the Devirtualizer pass to tell whether call sites are
"effectively final", and maybe make the Devirtualizer consume that information applied by this pass.
Constant folds the uniqueness result of begin_cow_mutation instructions, if it can be proved that the buffer argument is uniquely referenced.
For example:
%buffer = end_cow_mutation %mutable_buffer
// ...
// %buffer does not escape here
// ...
(%is_unique, %mutable_buffer2) = begin_cow_mutation %buffer
cond_br %is_unique, ...
is replaced with
%buffer = end_cow_mutation [keep_unique] %mutable_buffer
// ...
(%not_used, %mutable_buffer2) = begin_cow_mutation %buffer
%true = integer_literal 1
cond_br %true, ...
Note that the keep_unique flag is set on the end_cow_mutation because the code now relies on that the buffer is really uniquely referenced.
The optimization can also handle def-use chains between end_cow_mutation and begin_cow_mutation which involve phi-arguments.
An additional peephole optimization is performed: if the begin_cow_mutation is the only use of the end_cow_mutation, the whole pair of instructions is eliminated.
If only a single field of a struct phi-argument is used, replace the argument by the field value.
br bb(%str)
bb(%phi):
%f = struct_extract %phi, #Field // the only use of %phi
use %f
is replaced with
%f = struct_extract %str, #Field
br bb(%f)
bb(%phi):
use %phi
This also works if the phi-argument is in a def-use cycle.
The new PhiExpansionPass is in the same file as the RedundantPhiEliminationPass. Therefore I renamed the source file to PhiArgumentOptimizations.cpp
This simplifies the handling of the subdirectories in the SIL and
SILOptimizer paths. Create individual libraries as object libraries
which allows the analysis of the source changes to be limited in scope.
Because these are object libraries, this has 0 overhead compared to the
previous implementation. However, string operations over the filenames
are avoided. The cost for this is that any new sub-library needs to be
added into the list rather than added with the special local function.
Add DifferentiabilityWitnessDevirtualizer: an optimization pass that
devirtualizes `differentiability_witness_function` instructions into
`function_ref` instructions.
Co-authored-by: Dan Zheng <danielzheng@google.com>
RedundantPhiElimination eliminates block phi-arguments which have the same value as other arguments of the same block.
This also works with cycles, like two equivalent loop induction variables. Such patterns are generated e.g. when using stdlib's enumerated() on Array.
preheader:
br bb1(%initval, %initval)
header(%phi1, %phi2):
%next1 = builtin "add" (%phi1, %one)
%next2 = builtin "add" (%phi2, %one)
cond_br %loopcond, header(%next1, %next2), exit
exit:
is replaced with
preheader:
br bb1(%initval)
header(%phi1):
%next1 = builtin "add" (%phi1, %one)
%next2 = builtin "add" (%phi1, %one) // dead: will be cleaned-up later
cond_br %loopcond, header(%next1), exit
exit:
Any remaining dead or "trivially" equivalent instructions will then be cleaned-up by DCE and CSE, respectively.
rdar://problem/33438123
DestroyHoisting moves destroys of memory locations up the control flow as far as possible.
Beside destroy_addr, also "store [assign]" is considered a destroy, because is is equivalent to an destroy_addr + a "store [init]".
The main purpose of this optimization is to minimize copy-on-write operations for arrays, etc. Especially if such COW containers are used as enum payloads and modified in-place. E.g.
switch e {
case .A(var arr):
arr.append(x)
self = .A(arr)
...
In such a case DestroyHoisting can move the destroy of the self-assignment up before the switch and thus let the array buffer only be single-referenced at the time of the append.
When we have ownership SIL throughout the pass pipeline this optimization will replace the current destroy hoisting optimization in CopyForwarding.
For now, this optimization only runs in the mandatory pipeline (but not for -Onone) where we already have ownership SIL.
SR-10605
rdar://problem/50463362
General case:
begin_access A
...
strong_release / release_value / destroy
end_access
The release instruction can be sunk below the end_access instruction,
This extends the lifetime of the released value, but, might allow us to
Mark the access scope as no nested conflict.
General case:
—
begin_access A (may or may not have no_nested_conflict)
load/store
end_access
apply // may have a scoped access that conflicts with A
begin_access A [no_nested_conflict]
load/store
end_access A
—
The second access scope does not need to be emitted.
NOTE: KeyPath access must be identified at the top-level, non-inlinable stdlib entry point.
As such, The sodlib entry pointed is annotated by a new @_semantics that is equivalent to inline(never)
This is a simple "utility" pass that canonicalizes SSA SILValues with
respect to copies and destroys. It is a self-contained, provably
complete pass that eliminates spurious copy_value instructions from
scalar SSA SILValues. It fundamentally depends on ownership SIL, but
otherwise can be run efficiently after any other pass. It separates
the pure problem of handling scalar SSA values from the more important
and complex problems:
- Promoting variables to SSA form (PredictableMemOps and Mem2Reg
partially do this).
- Optimizing copies within "SIL borrow" scopes (another mandatory pass
will be introduced to do this).
- Composing and decomposing aggregates (SROA handles some of this).
- Coalescing phis (A BlockArgumentOptimizer will be introduced as part
of AddressLowering).
- Removing unnecessary retain/release when nothing within its scope
may release the same object (ARC Code Motion does some of this).
Note that removing SSA copies was more obviously necessary before the
migration to +0 argument convention.
All this does is automate the creation of the ${DIRNAME}_SOURCES variables that we already create and allows for the author to avoid having to prefix with the directory name, i.e.:
set(FOOBAR_SOURCES
FooBar/Source.cpp
PARENT_SCOPE)
=>
silopt_register_sources(
Source.cpp)
Much easier and cleaner to read. I put the code that implements this in the
CMakeLists.txt file just for the SILOptimizer.
Use AccessedStorageAnalysis to find access markers with no nested conflicts.
This optimization analyzes the scope of each access to determine
whether it contains a potentially conflicting access. If not, then it
can be demoted to an instantaneous check, which still catches
conflicts on any enclosing outer scope.
This removes up to half of the runtime calls associated with
exclusivity checking.
I am going to be adding logic here to enable apple/swift#1550 to be completed.
The rename makes sense due to precedent from LLVM's codegen prepare and also
since I am going to be expanding what the pass is doing beyond just "cleaning
up". It is really a grab bag pass for performing simple transformations that we
do not want to pollute IRGen's logic with.
https://github.com/apple/swift/pull/15502
rdar://39335800
We run GlobalOpt multiple times in the pass pipeline but in some cases object outlining shouldn't be done too early.
Having it done in a separate pass enables to run it independently from GlobalOpt.
We run GlobalOpt multiple times in the pass pipeline but in some cases object outlining shouldn't be done too early.
Having it done in a separate pass enables to run it independently from GlobalOpt.