[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4BC6AE82.3070703@us.ibm.com>
Date: Wed, 14 Apr 2010 23:13:22 -0700
From: Darren Hart <dvhltc@...ibm.com>
To: linux-kernel@...r.kernel.org
CC: Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...e.hu>,
Eric Dumazet <eric.dumazet@...il.com>,
"Peter W. Morreale" <pmorreale@...ell.com>,
Rik van Riel <riel@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Gregory Haskins <ghaskins@...ell.com>,
Sven-Thorsten Dietrich <sdietrich@...ell.com>,
Chris Mason <chris.mason@...cle.com>,
John Cooper <john.cooper@...rd-harmonic.com>,
Chris Wright <chrisw@...s-sol.org>,
Ulrich Drepper <drepper@...il.com>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Avi Kivity <avi@...hat.com>,
Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [PATCH V5 0/4][RFC] futex: FUTEX_LOCK with optional adaptive
spinning
dvhltc@...ibm.com wrote:
> Now that an advantage can be shown using FUTEX_LOCK_ADAPTIVE over FUTEX_LOCK,
> the next steps as I see them are:
>
> o Try and show improvement of FUTEX_LOCK_ADAPTIVE over FUTEX_WAIT based
> implementations (pthread_mutex specifically).
I've spent a bit of time on this, and made huge improvements through
some simple optimizations of the testcase lock/unlock routines. I'll be
away for a few days and wanted to let people know where things stand
with FUTEX_LOCK_ADAPTIVE.
I ran all the tests with the following options:
-i 1000000 -p 1000 -d 20
where:
-i iterations
-p period (in instructions)
-d duty cycle (in percent)
MECHANISM KITERS/SEC
----------------------------------
pthread_mutex_adaptive 1562
FUTEX_LOCK_ADAPTIVE 1190
pthread_mutex 1010
FUTEX_LOCK 532
I took some perf data while running each of the above tests as well. Any
thoughts on getting more from perf are appreciated, this is my first
pass at it. I recorded with "perf record -fg" and snippets of "perf
report" follow:
FUTEX_LOCK (not adaptive) spends a lot of time spinning on the futex
hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
40.76% futex_lock [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--62.16%-- do_futex
| sys_futex
| system_call_fastpath
| syscall
|
|--31.05%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| syscall
...
14.98% futex_lock futex_lock [.] locktest
FUTEX_LOCK_ADAPTIVE spends much of its time in the test loop itself,
followed by the actual adaptive loop in the kernel. It appears much of
our savings over FUTEX_LOCK comes from not contending on the hashbucket
lock.
# Overhead Command Shared Object Symbol
# ........ .......... .................. ......
#
36.07% futex_lock futex_lock [.] locktest
|
--- locktest
|
--100.00%-- 0x400e7000000000
9.12% futex_lock perf [.] 0x00000000000eee
...
8.26% futex_lock [kernel.kallsyms] [k] futex_spin_on_owner
Pthread Mutex Adaptive spends most of it's time in the glibc heuristic
spinning, as expected, followed by the test loop itself. An impressively
minimal 3.35% is spent on the hashbucket lock.
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
47.88% pthread_mutex_2 libpthread-2.5.so [.]
__pthread_mutex_lock_internal
|
--- __pthread_mutex_lock_internal
22.78% pthread_mutex_2 pthread_mutex_2 [.] locktest
...
15.16% pthread_mutex_2 perf [.] ...
...
3.35% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock
Pthread Mutex (not adaptive) spends much of it's time on the hashbucket
lock as expected, followed by the test loop.
33.89% pthread_mutex_2 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--56.90%-- futex_wake
| do_futex
| sys_futex
| system_call_fastpath
| __lll_unlock_wake
|
|--28.95%-- futex_wait_setup
| futex_wait
| do_futex
| sys_futex
| system_call_fastpath
| __lll_lock_wait
...
16.60% pthread_mutex_2 pthread_mutex_2 [.] locktest
These results mostly confirm the expected: the adaptive versions spend
more time in their spin loops and less time contending for hashbucket
locks while the non-adaptive versions take the hashbucket lock more
often, and therefore shore more contention there.
I believe I should be able to get the plain FUTEX_LOCK implementation to
be much closer in performance to the plain pthread mutex version. I
expect much of the work done to benefit FUTEX_LOCK will also benefit
FUTEX_LOCK_ADAPTIVE. If that's true, and I can make a significant
improvement to FUTEX_LOCK, it wouldn't take much to get
FUTEX_LOCK_ADAPTIVE to beat the heuristics spinlock in glibc.
It could also be that this synthetic benchmark is an ideal situation for
glibc's heuristics, and a more realistic load with varying lock hold
times wouldn't favor the adaptive pthread mutex over FUTEX_LOCK_ADAPTIVE
by such a large margin.
More next week.
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists