linux-kernel - Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 23 Sep 2016 15:02:00 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Waiman Long <waiman.long@....com>
cc:     Davidlohr Bueso <dave@...olabs.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Ingo Molnar <mingo@...nel.org>,
        Jonathan Corbet <corbet@....net>, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, Jason Low <jason.low2@....com>,
        Scott J Norton <scott.norton@....com>,
        Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 04:38 PM, Thomas Gleixner wrote:
> > >                  wait-wake futex     PI futex        TO futex
> > >                  ---------------     --------        --------
> > > max time            3.49s            50.91s          2.65s
> > > min time            3.24s            50.84s          0.07s
> > > average time        3.41s            50.90s          1.84s
> > > sys time          7m22.4s            55.73s        2m32.9s
> > That's really interesting. Do you have any explanation for this massive
> > system time differences?
> 
> For the wait-wake futexes, the waiters will be kicked out with EAGAIN without
> sleeping as long as the futex value changes before they acquire the hash
> bucket spinlock. These waiters will attempt to acquire the lock again in the
> user space. If that fails, they will call FUTEX_WAIT again.
> 
> In the above test case, the critical section is pretty short and about 4/5 of
> the FUTEX_WAIT calls resulted in an EAGAIN error. So there are a lot more
> user-space/kernel transitions than the TO futexes. I think the difference in
> the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO calls causes the
> difference between their sys times.

Ok.

> As for the PI futex, it is a strictly sleeping lock and so won't use that
> much sys time.

That's clear.

> > > Locking was done mostly by lock stealing. This is where most of the
> > > performance benefit comes from, not optimistic spinning.
> > How does the lock latency distribution of all this look like and how fair
> > is the whole thing?
> 
> The TO futexes are unfair as can be seen from the min/max thread times listed
> above. It took the fastest thread 0.07s to complete all the locking
> operations, whereas the slowest one needed 2.65s. However, the situation
> reverses when I changed the critical section to a 1us sleep. In this case,

1us sleep is going to add another syscall and therefor scheduling, so what?

Or did you just extend the critical section busy time?

> there will be no optimistic spinning. The performance results for 100k locking
> operations were listed below.
> 
>                 wait-wake futex     PI futex        TO futex
>                 ---------------     --------        --------
> max time            0.06s             9.32s          4.76s

      		      ^^^^ ????			

> min time            5.59s             9.36s          5.62s
> average time        3.25s             9.35s          5.41s
> 
> In this case, the TO futexes are fairer but perform worse than the wait-wake
> futexes. That is because the lock handoff mechanism limit the amount of lock
> stealing in the TO futexes while the wait-wake futexes have no such
> restriction. When I disabled  lock handoff, the TO futexes would then perform
> similar to the wait-wake futexes.

So the benefit of these new fangled futexes is only there for extreme short
critical sections and a gazillion of threads fighting for the same futex,
right?

I really wonder how the average programmer should pick the right flavour,
not to talk about any useful decision for something like glibc to pick the
proper one.

Thanks,

	tglx