linux-kernel - Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <57E99AF1.70209@hpe.com>
Date:   Mon, 26 Sep 2016 18:02:25 -0400
From:   Waiman Long <waiman.long@....com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     Davidlohr Bueso <dave@...olabs.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Ingo Molnar <mingo@...nel.org>,
        Jonathan Corbet <corbet@....net>,
        <linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        Jason Low <jason.low2@....com>,
        Scott J Norton <scott.norton@....com>,
        Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

On 09/23/2016 09:02 AM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>>>> Locking was done mostly by lock stealing. This is where most of the
>>>> performance benefit comes from, not optimistic spinning.
>>> How does the lock latency distribution of all this look like and how fair
>>> is the whole thing?
>> The TO futexes are unfair as can be seen from the min/max thread times listed
>> above. It took the fastest thread 0.07s to complete all the locking
>> operations, whereas the slowest one needed 2.65s. However, the situation
>> reverses when I changed the critical section to a 1us sleep. In this case,
> 1us sleep is going to add another syscall and therefor scheduling, so what?
>
> Or did you just extend the critical section busy time?

The 1us sleep will cause the spinning to stop and make all the waiters 
sleep. This is to simulate the extreme case where TO futex may not have 
the performance advantage.

>
>> there will be no optimistic spinning. The performance results for 100k locking
>> operations were listed below.
>>
>>                  wait-wake futex     PI futex        TO futex
>>                  ---------------     --------        --------
>> max time            0.06s             9.32s          4.76s
>        		^^^^ ????			

Yes, wait-wake futex is the unfair one in this case.

>> min time            5.59s             9.36s          5.62s
>> average time        3.25s             9.35s          5.41s
>>
>> In this case, the TO futexes are fairer but perform worse than the wait-wake
>> futexes. That is because the lock handoff mechanism limit the amount of lock
>> stealing in the TO futexes while the wait-wake futexes have no such
>> restriction. When I disabled  lock handoff, the TO futexes would then perform
>> similar to the wait-wake futexes.
> So the benefit of these new fangled futexes is only there for extreme short
> critical sections and a gazillion of threads fighting for the same futex,
> right?

Not really. Lock stealing will help performance when a gazillion of 
threads fighting for the same futex. Optimistic spinning will help to 
reduce the lock transfer latency because the waiter isn't sleeping no 
matter the number of threads. One set of data that I haven't shown so 
far is that the performance delta between wait-wait and TO futexes 
actually increases as the critical section is lengthened. This is 
because for short critical section, the waiters of wait-wake futex may 
not actually go to sleep because of the latency introduced by the code 
that has to be run before they do a final check to see if the futex 
value change before going to sleep. The longer the critical section, the 
higher the chance that they actually sleep and hence their performance 
is getting worse relative to the TO futexes.

For example, with the critical section of 50 pause instructions instead 
of 5, the performance gain is about 5X instead of about 1.6X in the 
latter case.

> I really wonder how the average programmer should pick the right flavour,
> not to talk about any useful decision for something like glibc to pick the
> proper one.

I would say that TO futexes will have better performance in most cases. 
Of course, I still need to run some real world benchmarks to quantify 
the effect of the new futexes. I am hoping to get suggestion of what is 
a good set of benchmarks to run.

Cheers,
Longman