[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57E99AF1.70209@hpe.com>
Date: Mon, 26 Sep 2016 18:02:25 -0400
From: Waiman Long <waiman.long@....com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: Davidlohr Bueso <dave@...olabs.net>,
Peter Zijlstra <peterz@...radead.org>,
Mike Galbraith <umgwanakikbuti@...il.com>,
Ingo Molnar <mingo@...nel.org>,
Jonathan Corbet <corbet@....net>,
<linux-kernel@...r.kernel.org>, <linux-doc@...r.kernel.org>,
Jason Low <jason.low2@....com>,
Scott J Norton <scott.norton@....com>,
Douglas Hatch <doug.hatch@....com>
Subject: Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes
On 09/23/2016 09:02 AM, Thomas Gleixner wrote:
> On Thu, 22 Sep 2016, Waiman Long wrote:
>>>> Locking was done mostly by lock stealing. This is where most of the
>>>> performance benefit comes from, not optimistic spinning.
>>> How does the lock latency distribution of all this look like and how fair
>>> is the whole thing?
>> The TO futexes are unfair as can be seen from the min/max thread times listed
>> above. It took the fastest thread 0.07s to complete all the locking
>> operations, whereas the slowest one needed 2.65s. However, the situation
>> reverses when I changed the critical section to a 1us sleep. In this case,
> 1us sleep is going to add another syscall and therefor scheduling, so what?
>
> Or did you just extend the critical section busy time?
The 1us sleep will cause the spinning to stop and make all the waiters
sleep. This is to simulate the extreme case where TO futex may not have
the performance advantage.
>
>> there will be no optimistic spinning. The performance results for 100k locking
>> operations were listed below.
>>
>> wait-wake futex PI futex TO futex
>> --------------- -------- --------
>> max time 0.06s 9.32s 4.76s
> ^^^^ ????
Yes, wait-wake futex is the unfair one in this case.
>> min time 5.59s 9.36s 5.62s
>> average time 3.25s 9.35s 5.41s
>>
>> In this case, the TO futexes are fairer but perform worse than the wait-wake
>> futexes. That is because the lock handoff mechanism limit the amount of lock
>> stealing in the TO futexes while the wait-wake futexes have no such
>> restriction. When I disabled lock handoff, the TO futexes would then perform
>> similar to the wait-wake futexes.
> So the benefit of these new fangled futexes is only there for extreme short
> critical sections and a gazillion of threads fighting for the same futex,
> right?
Not really. Lock stealing will help performance when a gazillion of
threads fighting for the same futex. Optimistic spinning will help to
reduce the lock transfer latency because the waiter isn't sleeping no
matter the number of threads. One set of data that I haven't shown so
far is that the performance delta between wait-wait and TO futexes
actually increases as the critical section is lengthened. This is
because for short critical section, the waiters of wait-wake futex may
not actually go to sleep because of the latency introduced by the code
that has to be run before they do a final check to see if the futex
value change before going to sleep. The longer the critical section, the
higher the chance that they actually sleep and hence their performance
is getting worse relative to the TO futexes.
For example, with the critical section of 50 pause instructions instead
of 5, the performance gain is about 5X instead of about 1.6X in the
latter case.
> I really wonder how the average programmer should pick the right flavour,
> not to talk about any useful decision for something like glibc to pick the
> proper one.
I would say that TO futexes will have better performance in most cases.
Of course, I still need to run some real world benchmarks to quantify
the effect of the new futexes. I am hoping to get suggestion of what is
a good set of benchmarks to run.
Cheers,
Longman
Powered by blists - more mailing lists