linux-kernel - Re: [PATCH 0/5] futex: Wakeup optimizations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1385188681.5402.25.camel@buesod1.americas.hpqcorp.net>
Date:	Fri, 22 Nov 2013 22:38:01 -0800
From:	Davidlohr Bueso <davidlohr@...com>
To:	Darren Hart <dvhart@...ux.intel.com>
Cc:	linux-kernel@...r.kernel.org, mingo@...nel.org,
	peterz@...radead.org, tglx@...utronix.de, efault@....de,
	jeffm@...e.com, torvalds@...ux-foundation.org, scott.norton@...com,
	tom.vaden@...com, aswin@...com, Waiman.Long@...com,
	jason.low2@...com
Subject: Re: [PATCH 0/5] futex: Wakeup optimizations

Hi Darren,

On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote:
> On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote:
> > We have been dealing with a customer database workload on large
> > 12Tb, 240 core 16 socket NUMA system that exhibits high amounts 
> > of contention on some of the locks that serialize internal futex 
> > data structures. This workload specially suffers in the wakeup 
> > paths, where waiting on the corresponding hb->lock can account for 
> > up to ~60% of the time. The result of such calls can mostly be 
> > classified as (i) nothing to wake up and (ii) wakeup large amount 
> > of tasks.
> 
> With as many cores as you have, have you done any analysis of how
> effective the hashing algorithm is, and would more buckets relieve someHi 
> of the contention.... ah, I see below that you did. Nice work.
> 
> > Before these patches are applied, we can see this pathological behavior:
> > 
> >  37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
> >             --- _raw_spin_lock
> >              |
> >              |--97.14%-- futex_wake
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          |
> >              |          |--99.70%-- 0x7f383fbdea1f
> >              |          |           yyy
> > 
> >  43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
> >             --- _raw_spin_lock
> >              |
> >              |--53.74%-- futex_wake
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          |
> >              |          |--99.40%-- 0x7fe7d44a4c05
> >              |          |           zzz
> >              |--45.90%-- futex_wait_setup
> >              |          futex_wait
> >              |          do_futex
> >              |          sys_futex
> >              |          system_call_fastpath
> >              |          0x7fe7ba315789
> >              |          syscall
> > 
> 
> Sorry to be dense, can you spell out how 60% falls out of these numbers?

By adding the respective percentages of futex_wake()*_raw_spin_lock
calls.

> 
> > 
> > With these patches, contention is practically non existent:
> > 
> >  0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
> >                --- _raw_spin_lock
> >                 |
> >                 |--76.06%-- futex_wait_setup
> >                 |          futex_wait
> >                 |          do_futex
> >                 |          sys_futex
> >                 |          system_call_fastpath
> >                 |          |
> >                 |          |--99.90%-- 0x7f3165e63789
> >                 |          |          syscall|
> >                            ...
> >                 |--6.27%-- futex_wake
> >                 |          do_futex
> >                 |          sys_futex
> >                 |          system_call_fastpath
> >                 |          |
> >                 |          |--54.56%-- 0x7f317fff2c05
> >                 ...
> > 
> > Patches 1 & 2 are cleanups and micro optimizations.
> > 
> > Patch 3 addresses the well known issue of the global hash table.
> > By creating a larger and NUMA aware table, we can reduce the false
> > sharing and collisions, thus reducing the chance of different futexes 
> > using hb->lock.
> > 
> > Patch 4 reduces contention on the corresponding hb->lock by not trying to
> > acquire it if there are no blocked tasks in the waitqueue.
> > This particularly deals with point (i) above, where we see that it is not
> > uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
> > tasks were woken.
> 
> Can you determine how much benefit comes from 3 and how much additional
> benefit comes from 4?

While I don't have specific per-patch data, there are indications that
the workload mostly deals with a handful of futexes. So its pretty safe
to assume that patch 4 is the one with the most benefit for _this_
particular workload.

> 
> > 
> > Patch 5 resurrects a two year old idea from Peter Zijlstra to delay
> > the waking of the blocked tasks to be done without holding the hb->lock:
> > https://lkml.org/lkml/2011/9/14/118
> > 
> > This is useful for locking primitives that can effect multiple wakeups
> > per operation and want to avoid the futex's internal spinlock contention by
> > delaying the wakeups until we've released the hb->lock.
> > This particularly deals with point (ii) above, where we can observe that
> > in occasions the wake calls end up waking 125 to 200 waiters in what we believe 
> > are RW locks in the application.
> > 
> > This patchset has also been tested on smaller systems for a variety of
> > benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
> > hb locks programs. So far, no functional or performance regressions have been seen.
> > Furthermore, no issues were found when running the different tests in the futextest 
> > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/
> 
> Excellent. Would you be able to contribute any of these (C only please)
> to the stress test group?
> 

Sure.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/