Commit Graph

1369 Commits

Author SHA1 Message Date
Konrad `ktoso` Malawski
1044723787 [Concurrency] Initial Task Local Values implementation 2021-02-13 10:39:22 +09:00
Andrew Trick
29b9e51cb6 Remove ReadNone attribute from runtime functions
The Swift compiler incorrectly sets the LLVM "ReadNone" attribute when
declaring swift_getObjCClassFromObject and swift_projectBox. This
means that the LLVM ARC optimizer will hoist ARC "release" operations
above these runtime calls. Since the implementation of the calls reads
the object, this causes a use-after-free crash.

This problem is easy to reproduce with
swift_getObjCClassFromObject. It was only exposed recently because the
SIL optimizer now shrinks object lifetimes, making it easier for LLVM
optimizations to kick in. It is difficult to expose the problem with
swift_projectBox, but the bug/fix is still obvious.

Fixes rdar://73820091: Use-after free application crash.
2021-02-08 14:53:24 -08:00
Nate Chandler
c59b01feee [Runtime] Ptrauth for runAsyncAndBlock.
rdar://72357371
2021-02-04 20:19:26 -08:00
tbkka
a32dacb131 SwiftDtoa v2: Better, Smaller, Faster floating-point formatting (#35299)
* SwiftDtoa v2: Better, Smaller, Faster floating-point formatting

SwiftDtoa is the C/C++ code used in the Swift runtime to produce the textual representations used by the `description` and `debugDescription` properties of the standard Swift floating-point types.
This update includes a number of algorithmic improvements to SwiftDtoa to improve portability, reduce code size, and improve performance but does not change the actual output.

About SwiftDtoa
===============

In early versions of Swift, the `description` properties used the C library `sprintf` functionality with a fixed number of digits.
In 2018, that logic was replaced with the first version of SwiftDtoa which used used a fast, adaptive algorithm to automatically choose the correct number of digits for a particular value.
The resulting decimal output is always:

* Accurate.  Parsing the decimal form will yield exactly the same binary floating-point value again. This guarantee holds for any parser that accurately implements IEEE 754. In particular, the Swift standard library can guarantee that for any Double `d` that is not a NaN, `Double(d.description) == d`.

* Short. Among all accurate forms, this form has the fewest significant digits. (Caution: Surprisingly, this is not the same as minimizing the number of characters. In some cases, minimizing the number of characters requires producing additional significant digits.)

* Close. If there are multiple accurate, short forms, this code chooses the decimal form that is closest to the exact binary value.  If there are two exactly the same distance, the one with an even final digit will be used.

Algorithms that can produce this "optimal" output have been known since at least 1990, when Steele and White published their Dragon4 algorithm.
However, Dragon4 and other algorithms from that period relied on high-precision integer arithmetic, which made them slow.
More recently, a surge of interest in this problem has produced dramatically better algorithms that can produce the same results using only fast fixed-precision arithmetic.

This format is ideal for JSON and other textual interchange: accuracy ensures that the value will be correctly decoded, shortness minimizes network traffic, and the existence of high-performance algorithms allows this form to be generated more quickly than many `printf`-based implementations.

This format is also ideal for logging, debugging, and other general display. In particular, the shortness guarantee avoids the confusion of unnecessary additional digits, so that the result of `1.0 / 10.0` consistently displays as `0.1` instead of `0.100000000000000000001`.

About SwiftDtoa v2
==================

Compared to the original SwiftDtoa code, this update is:

**Better**:
The core logic is implemented using only C99 features with 64-bit and smaller integer arithmetic.
If available, 128-bit integers are used for better performance.
The core routines do not require any floating-point support from the C/C++ standard library and with only minor modifications should be usable on systems with no hardware or software floating-point support at all.
This version also has experimental support for IEEE 754 binary128 format, though this support is obviously not included when compiling for the Swift standard library.

**Smaller**:
Code size reduction compared to the earlier versions was a primary goal for this effort.
In particular, the new binary128 support shares essentially all of its code with the float80 implementation.

**Faster**:
Even with the code size reductions, all formats are noticeably faster.
The primary performance gains come from three major changes:
Text digits are now emitted directly in the core routines in a form that requires only minimal adjustment to produce the final text.
Digit generation produces 2, 4, or even 8 digits at a time, depending on the format.
The double logic optimistically produces 7 digits in the initial scaling with a Ryu-inspired backtracking when fewer digits suffice.

SwiftDtoa's algorithms
======================

SwiftDtoa started out as a variation of Florian Loitsch' Grisu2 that addressed the shortness failures of that algorithm.
Subsequent work has incorporated ideas from Errol3, Ryu, and other sources to yield a production-quality implementation that is performance- and size-competitive with current research code.

Those who wish to understand the details can read the extensive comments included in the code.
Note that float16 actually uses a different algorithm than the other formats, as the extremely limited range can be handled with much simpler techniques.
The float80/binary128 logic sacrifices some performance optimizations in order to minimize the code size for these less-used formats; the goal for SwiftDtoa v2 has been to match the float80 performance of earlier implementations while reducing code size and widening the arithmetic routines sufficiently to support binary128.

SwiftDtoa Testing
=================

A newly-developed test harness generates several large files of test data that include known-correct results computed with high-precision arithmetic routines.
The test files include:
* Critical values generated by the algorithm presented in the Errol paper (about 48 million cases for binary128)
* Values for which the optimal decimal form is exactly midway between two binary floating-point values.
* All exact powers of two representable in this format.
* Floating-point values that are close to exact powers of ten.

In addition, several billion random values for each format were compared to the results from other implementations.
For binary16 and binary32 this provided exhaustive validation of every possible input value.

Code Size and Performance
=========================

The tables below summarize the code size and performance for the SwiftDtoa C library module by itself on several different processor architectures.
When used from Swift, the `.description` and `.debugDescription` implementations incur additional overhead for creating and returning Swift strings that are not captured here.

The code size tables show the total size in bytes of the compiled `.o` object files for a particular version of that code.
The headings indicate the floating-point formats supported by that particular build (e.g., "16,32" for a version that supports binary16 and binary32 but no other formats).

The performance numbers below were obtained from a custom test harness that generates random bit patterns, interprets them as the corresponding floating-point value, and averages the overall time.
For float80, the random bit patterns were generated in a way that avoids generating invalid values.

All code was compiled with the system C/C++ compiler using `-O2` optimization.

A few notes about particular implementations:
* **SwiftDtoa v1** is the original SwiftDtoa implementation as committed to the Swift runtime in April 2018.
* **SwiftDtoa v1a** is the same as SwiftDtoa v1 with added binary16 support.
* **SwiftDtoa v2** can be configured with preprocessor macros to support any subset of the supported formats.  I've provided sizes here for several different build configurations.
* **Ryu** (Ulf Anders) implements binary32 and binary64 as completely independent source files.  The size here is the total size of the two .o object files.
* **Ryu(size)** is Ryu compiled with the `RYU_OPTIMIZE_SIZE` option.
* **Dragonbox** (Junekey Jeon).  The size here is the compiled size of a simple `.cpp` file that instantiates the template for the specified formats, plus the size of the associated text output logic.
* **Dragonbox(size)** is Dragonbox compiled to minimize size by using a compressed power-of-10 table.
* **gdtoa** has a very large feature set.  For this reason, I excluded it from the code size comparison since I didn't consider the numbers to be comparable to the others.

x86_64
----------------

These were built using Apple clang 12.0.5 on a 2019 16" MacBook Pro (2.4GHz 8-core Intel Core i9) running macOS 11.1.

**Code Size**

Bold numbers here indicate the configurations that have shipped as part of the Swift runtime.

|               | 16,32,64,80 | 32,64,80    | 32,64       |
|---------------|------------:|------------:|------------:|
|SwiftDtoa v1   |             |   **15128** |             |
|SwiftDtoa v1a  |   **16888** |             |             |
|SwiftDtoa v2   |   **20220** |     18628   |        8248 |
|Ryu            |             |             |       40408 |
|Ryu(size)      |             |             |       23836 |
|Dragonbox      |             |             |       23176 |
|Dragonbox(size)|             |             |       15132 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |     25ns |     46ns |    82ns |           |
|SwiftDtoa v1a |     37ns |     26ns |     47ns |    83ns |           |
|SwiftDtoa v2  |     22ns |     19ns |     31ns |    72ns |      90ns |
|Ryu           |          |     19ns |     26ns |         |           |
|Ryu(size)     |          |     17ns |     24ns |         |           |
|Dragonbox     |          |     19ns |     24ns |         |           |
|Dragonbox(size) |        |     19ns |     29ns |         |           |
|gdtoa         |    220ns |    381ns |   1184ns | 16044ns |   22800ns |

ARM64
----------------

These were built using Apple clang 12.0.0 on a 2020 M1 Mac Mini running macOS 11.1.

**Code Size**

|               | 16,32,64 | 32,64 |
|---------------|---------:|------:|
|SwiftDtoa v1   |          |  7436 |
|SwiftDtoa v1a  |     9124 |       |
|SwiftDtoa v2   |     9964 |  8228 |
|Ryu            |          | 35764 |
|Ryu(size)      |          | 16708 |
|Dragonbox      |          | 27108 |
|Dragonbox(size)|          | 19172 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |     21ns |     39ns |         |           |
|SwiftDtoa v1a |     17ns |     21ns |     39ns |         |           |
|SwiftDtoa v2  |     15ns |     17ns |     29ns |    54ns |      71ns |
|Ryu           |          |     15ns |     19ns |         |           |
|Ryu(size)     |          |     29ns |     24ns |         |           |
|Dragonbox     |          |     16ns |     24ns |         |           |
|Dragonbox(size) |        |     15ns |     34ns |         |           |
|gdtoa         |    143ns |    242ns |    858ns | 25129ns |   36195ns |

ARM32
----------------

These were built using clang 8.0.1 on a BeagleBone Black (500MHz ARMv7) running FreeBSD 12.1-RELEASE.

**Code Size**

|               | 16,32,64 | 32,64 |
|---------------|---------:|------:|
|SwiftDtoa v1   |          |  8668 |
|SwiftDtoa v1a  |    10356 |       |
|SwiftDtoa v2   |     9796 |  8340 |
|Ryu            |          | 32292 |
|Ryu(size)      |          | 14592 |
|Dragonbox      |          | 29000 |
|Dragonbox(size)|          | 21980 |

**Performance**

|              | binary16 | binary32 | binary64 | float80 | binary128 |
|--------------|---------:|---------:|---------:|--------:|----------:|
|SwiftDtoa v1  |          |    459ns |   1152ns |         |           |
|SwiftDtoa v1a |    383ns |    451ns |   1148ns |         |           |
|SwiftDtoa v2  |    202ns |    357ns |    715ns |  2720ns |    3379ns |
|Ryu           |          |    345ns |   5450ns |         |           |
|Ryu(size)     |          |    786ns |   5577ns |         |           |
|Dragonbox     |          |    300ns |    904ns |         |           |
|Dragonbox(size) |        |    294ns |   1021ns |         |           |
|gdtoa         |   2180ns |   4749ns |  18742ns |293000ns |  440000ns |

* This is fast enough now even for non-optimized test runs

* Fix float80 Nan/Inf parsing, comment more thoroughly
2021-01-27 14:35:55 -08:00
Arnold Schwaighofer
b0d31730de Merge pull request #35303 from aschwaighofer/introduce_swiftasync_lowering
Add llvm::Attribute::SwiftAsync to the context parameter
2021-01-26 06:03:19 -08:00
Saleem Abdulrasool
8b5869ce62 Merge pull request #35566 from mikeash/whoops-i-did-it-again
[Runtime] #ifdef __clang__ on printf __attribute__ in Portability.h
2021-01-22 20:00:12 -08:00
Mike Ash
c673c13548 [Runtime] #ifdef __clang__ on printf __attribute__ in Portability.h 2021-01-22 19:36:28 -05:00
Konrad `ktoso` Malawski
80ee936a72 Revert "[Concurrency] isCanceled spelling to follow guidance" 2021-01-23 07:27:34 +09:00
Mike Ash
4988127e99 Merge pull request #35525 from mikeash/swift_asprintf-static-checking
[Runtime] Mark swift_asprintf with __attribute__((__format__))
2021-01-22 15:04:39 -05:00
Arnold Schwaighofer
daa72d3cc5 Add llvm::Attribute::SwiftAsync to the context parameter
* Adds support for generating code that uses swiftasync parameter lowering.

* Currently only arm64's llvm lowering supports the swift_async_context_addr intrinsic.

* Add arm64e pointer signing of updated swift_async_context_addr.

This commit needs the PR llvm-project#2291.

* [runtime] unittests should use just-built compiler if the runtime did

This will start to matter with the introduction of usage of swiftasync parameters which only very recent compilers support.

rdar://71499498
2021-01-22 10:01:55 -08:00
Mike Ash
216e555ad6 [Runtime] Mark swift_asprintf with __attribute__((__format__))
This gives us build-time warnings about format string mistakes, like we would get if we called the built-in asprintf directly.

Make TypeLookupError's format string constructor a macro instead so that its callers can get these build-time warnings.

This reveals various mistakes in format strings and arguments in the runtime, which are now fixed.

rdar://73417805
2021-01-22 10:54:45 -05:00
Konrad `ktoso` Malawski
8b37455774 [Concurrency] isCanceled spelling to follow guidance 2021-01-22 12:09:19 +09:00
Evan
4f708fad7e Merge pull request #35215 from etcwilde/ewilde/async-main
Adding async-main support

Resolves: rdar://71828636
2021-01-15 10:57:14 -08:00
Kavon Farvardin
91e2246c19 quick-and-dirty implementation of MainActor in the runtime system
it is "dirty" in the sense that we don't have proper support for
custom executors right now.
2021-01-14 17:43:58 -08:00
Evan Wilde
6b16657922 Explode on uncaught error thrown in main
This patch has two desirable effects for the price of one.
 1. An uncaught error thrown from main will now explode
 2. Move us off of using runAsyncAndBlock

The issue with runAsyncAndBlock is that it blocks the main thread
outright. UI and the main actor need to run on the main thread or bad
things happen, so blocking the main thread results in a bad day for
them.

Instead, we're using CFRunLoopRun to run the core-foundation run loop on
the main thread, or, dispatch_main if CFRunLoopRun isn't available.
The issue with just using dispatch_main is that it doesn't actually
guarantee that it will run the tasks on the main thread either, just
that it clears the main queue. We don't want to require everything that
uses concurrency to have to include CoreFoundation either, but dispatch
is already required, which supplies dispatch_main, which just empties
out the main queue.
2021-01-13 15:49:28 -08:00
Mike Ash
1557dbcb86 [Runtime] Make Concurrent.h ReaderCount loads seq_cst as well since they need to be ordered with the emplacement of new storage pointers. 2021-01-12 16:46:08 -05:00
Mike Ash
29c9f2af52 Merge pull request #35348 from mikeash/fix-concurrentarray-memory-order2
[Runtime] Fix incorrect memory ordering in ConcurrentReadableArray/HashMap
2021-01-12 14:44:43 -05:00
Mike Ash
ef8d20a0e6 [Runtime] Fix incorrect memory ordering in ConcurrentReadableArray/HashMap.
When reallocating storage, the storing the pointer to the new storage had insufficiently strong ordering. This could cause writers to check the reader count before storing the new storage pointer. If a reader then came in between that load and store, it would end up using the old storage pointer, while the writer could end up freeing it.

Also adjust the Concurrent.cpp tests to test with a variety of reader and writer counts. Counterintuitively, when freeing garbage is gated on there being zero readers, having a single reader will shake out problems that having lots of readers will not. We were testing with lots of readers in order to stress the code as much as possible, but this resulted in it being extremely rare for writers to ever see zero active readers.

rdar://69798617
2021-01-12 10:41:56 -05:00
Joe Groff
f2d28cfc12 Merge pull request #35188 from jckarter/checked-continuation
Concurrency: Introduce a `CheckedContinuation` adapter.
2021-01-11 13:39:34 -08:00
Mike Ash
13f84ddd7e [Runtime] Fix premature snapshot destruction in StableAddressConcurrentReadableHashMap::find.
Mark the relevant snapshot methods as ref-qualified (adding a & after the parameter list) so the compiler will enforce this for us and forbid calling them on temporaries.

rdar://72997638
2021-01-11 10:59:08 -05:00
Joe Groff
c151f4b02b Concurrency: Introduce a CheckedContinuation adapter.
To help catch runtime issues adopting `withUnsafeContinuation`, such as callback-based APIs that misleadingly
invoke their callback multiple times and/or not at all, provide a couple of classes that can take ownership of
a fresh `UnsafeContinuation` or `UnsafeThrowingContinuation`, and log attempts to resume the continuation multiple times
or discard the object without ever resuming the continuation.
2021-01-08 13:50:42 -08:00
Mike Ash
4ccfdfb72a [Runtime] Fix StableAddressConcurrentReadableHashMap threading crash.
StableAddressConcurrentReadableHashMap::getOrInsert had a race condition in the first lookup, where the snapshot was destroyed before the pointer was extracted from the returned wrapper. Fix this by creating the snapshot outside the if so that it stays alive.

rdar://problem/71932487
2020-12-21 16:52:56 -05:00
Konrad `ktoso` Malawski
b267778bf1 Rebased to use new global executor 2020-12-17 06:05:13 +09:00
Konrad `ktoso` Malawski
9e1ecc539c [Concurrency] guard offer/poll with a lock for now; cleanups 2020-12-17 06:05:13 +09:00
Konrad `ktoso` Malawski
7b37554096 [Concurrency] Initial TaskGroup implementation working 2020-12-17 06:05:13 +09:00
Konrad `ktoso` Malawski
e294c7cbad [Concurrency] Implement TaskGroup.isEmpty via readyQueue
before reversing order of fragments; future must be last since dynamic
size

offer fixed

before implementing poll
2020-12-17 06:05:13 +09:00
Konrad `ktoso` Malawski
520b513e8a [Concurrency] Task: isCancelled,checkCancelled implementation
move comments to the wired up continuations

remove duplicated continuations; leep the wired up ones

before moving to C++ for queue impl

trying to next wait via channel_poll

submitting works; need to impl next()
2020-12-17 06:05:13 +09:00
Mike Ash
aac942c330 Merge pull request #35061 from mikeash/protocol-conformance-iteration-order-workaround
[Runtime] Add a disabled workaround for protocol conformance checking to check conformances in reverse order.
2020-12-14 16:38:04 -05:00
Mike Ash
9ac3b0ee3b [Runtime] Add a disabled workaround for protocol conformance checking to check conformances in reverse order.
rdar://problem/72049977
2020-12-14 12:59:18 -05:00
John McCall
d874479290 Add builtins to initialize and destroy a default-actor member.
It would be more abstractly correct if this got DI support so
that we destroy the member if the constructor terminates
abnormally, but we can get to that later.
2020-12-10 19:18:53 -05:00
John McCall
1177cde4e3 Use current public Dispatch API to schedule global work.
We expect to iterate on this quite a bit, both publicly
and internally, but this is a fine starting-point.

I've renamed runAsync to runAsyncAndBlock to underline
very clearly what it does and why it's not long for this
world.  I've also had to give it a radically different
implementation in an effort to make it continue to work
given an actor implementation that is no longer just
running all work synchronously.

The major remaining bit of actor-scheduling work is to
make swift_task_enqueue actually do something sensible
based on the executor it's been given; currently it's
expecting a flag that IRGen simply doesn't know to set.
2020-12-10 19:18:53 -05:00
Doug Gregor
cc2e32f2fa Merge pull request #34902 from jckarter/continuation-resume-operations
Concurrency: Implement `resume(returning:)` and `resume(throwing:)` for Unsafe*Continuation.
2020-12-07 20:59:03 -08:00
Joe Groff
185f44e2d3 Concurrency: Implement resume(returning:) and resume(throwing:) for Unsafe*Continuation. 2020-12-07 07:47:47 -08:00
Doug Gregor
4957a5e8b4 Merge pull request #34969 from DougGregor/swift_get_jobflags
[Concurrency] Traffic in underlying C type rather than JobFlags.
2020-12-04 15:39:58 -08:00
Doug Gregor
105a27322d [Concurrency] Traffic in underlying C type rather than JobFlags.
Suppresses a warning about returning a C++ type from a C entrypoint,
fixing rdar://71731181.
2020-12-04 12:29:18 -08:00
John McCall
c346d94655 Fix the build and implementation of the 16-byte atomics for MSVC.
Credit for the cmake fix here goes to Saleem Abdulrasool.

The substantive fix is embarrassing; I didn't pay close attention
to the intrinsic's argument order and just assumed that the first
argument for the replacement value was the low half (the part
you'd find at index 0 if it were an array), but in fact it's the
high half (the part you'd find at index 1).

I also change the code to be much more reinterpret_casty, which
isolates the type-punning mostly "within" the intrinsic, and
which seems to match how other code uses it.
2020-12-04 01:51:29 -05:00
Erik Eckstein
4e296c0c5a IRGen: emit emit_hop_to_executor instructions.
This basically means to create a suspension point and call the swift_task_switch runtime function.

rdar://71099289
2020-12-03 12:42:41 +01:00
John McCall
a53d83f4db Disable the swift::atomic<..., 16> specialization until we can
figure out how to convince clang-cl to accept it.

We need to make progress.
2020-12-02 18:47:13 -05:00
John McCall
853a8657dd Add a basic default-actor implementation with support for
eagerly adopting and abandoning threads.
2020-12-02 18:47:02 -05:00
John McCall
78a317ad75 Add support for declaring thread-local variables in the runtime.
Use native thread-locals when available and simulated
thread-locals when not.  The simulation layer uses
pthread_getspecific.

Using TLS is significantly more annoying this way, but I kindof
like it because it reinforces that TLS accesses aren't as cheap
as they look.
2020-12-02 18:47:02 -05:00
John McCall
ca16548470 Add a swift::atomic<T> which uses a better ABI than MSVC's 128-bit std::atomic. 2020-12-02 18:47:02 -05:00
John McCall
0555a86f82 Extract executor stuff out into a separate header and
start introducing the idea of a swiftasync CC.
2020-12-02 18:47:02 -05:00
Richard Wei
de2dbe57ed [AutoDiff] Bump-pointer allocate pullback structs in loops. (#34886)
In derivatives of loops, no longer allocate boxes for indirect case payloads. Instead, use a custom pullback context in the runtime which contains a bump-pointer allocator.

When a function contains a differentiated loop, the closure context is a `Builtin.NativeObject`, which contains a `swift::AutoDiffLinearMapContext` and a tail-allocated top-level linear map struct (which represents the linear map struct that was previously directly partial-applied into the pullback). In branching trace enums, the payloads of previously indirect cases will be allocated by `swift::AutoDiffLinearMapContext::allocate` and stored as a `Builtin.RawPointer`.
2020-11-30 15:49:38 -08:00
Konrad `ktoso` Malawski
7126f32f22 fix properly, by using Task.JobFlags which does conversions 2020-11-20 00:46:40 -08:00
Konrad `ktoso` Malawski
261f0d2dcd +task Implement Task.currentPriority 2020-11-20 09:00:41 +09:00
Doug Gregor
894528062d [Concurrency] Implement Task.Handle.get() in terms of an async runtime call.
Switch the contract between the runtime operation `swift_future_task_wait`
and Task.Handle.get() pver to an asynchronous call, so that the
compiler will set up the resumption frame for us. This allows us to
correctly wait on futures.

Update our "basic" future test to perform both normal returns and
throwing returns from a future, either having to wait on the queue or
coming by afterward.
2020-11-18 22:02:14 -08:00
Arnold Schwaighofer
fa54ff8568 IRGen: Fix lowering of Builtin.createAsyncTask and Builtin.createAsyncTaskFuture
Thick async functions store their async context size in the closure
context. Only if the closure context is nil can we assume the
partial_apply_forwarder function to be the address of an async function
pointer struct value.
2020-11-16 13:33:55 -08:00
Doug Gregor
069dfad638 [Concurrency] Add Builtin.createAsyncTaskFuture.
This new builtin allows the creation of a "future" task, which calls
down to swift_task_create_future to actually form the task.
2020-11-15 22:37:13 -08:00
Doug Gregor
7f3db7fbde [Concurrency] Have future functions write their results directly.
Introduce `FutureAsyncContext` to line up with the async context formed
by IR generation for the type `<T> () async throws -> T`. When allocating
a future task, set up the context with the address of the future's storage
for the successful result and null out the error result, so the caller
will directly fill in the result. This eliminates a bunch of extra
complexity and a copy.
2020-11-14 15:18:54 -08:00
Doug Gregor
4b924673ce [Concurrency] Use a single atomic for future wait queue.
Use a single atomic for the wait queue that combines the status with
the first task in the queue. Address race conditions in waiting and
completing the future.

Thanks to John for setting the direction here for me.
2020-11-14 15:18:54 -08:00