Use native thread-locals when available and simulated
thread-locals when not. The simulation layer uses
pthread_getspecific.
Using TLS is significantly more annoying this way, but I kindof
like it because it reinforces that TLS accesses aren't as cheap
as they look.