linux-kernel - Re: [PATCH -tip] introduce sys_membarrier(): process-wide memory barrier (v9)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100304122304.GA6864@elte.hu>
Date:	Thu, 4 Mar 2010 13:23:04 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Nicholas Miell <nmiell@...cast.net>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	laijs@...fujitsu.com, dipankar@...ibm.com,
	akpm@...ux-foundation.org, josh@...htriplett.org,
	dvhltc@...ibm.com, niv@...ibm.com, tglx@...utronix.de,
	peterz@...radead.org, Valdis.Kletnieks@...edu, dhowells@...hat.com,
	linux-kernel@...r.kernel.org, Nick Piggin <npiggin@...e.de>,
	Chris Friesen <cfriesen@...tel.com>,
	Fr??d??ric Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH -tip] introduce sys_membarrier(): process-wide memory
 barrier (v9)


* Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:

> I am proposing this patch for the 2.6.34 merge window, as I think it is 
> ready for inclusion.

It's a bit late for this merge window i think.

> Here is an implementation of a new system call, sys_membarrier(), which 
> executes a memory barrier on all threads of the current process. It can be 
> used to distribute the cost of user-space memory barriers asymmetrically by 
> transforming pairs of memory barriers into pairs consisting of 
> sys_membarrier() and a compiler barrier. For synchronization primitives that 
> distinguish between read-side and write-side (e.g. userspace RCU, rwlocks), 
> the read-side can be accelerated significantly by moving the bulk of the 
> memory barrier overhead to the write-side.

Why is this such a low level and still special-purpose facility?

Synchronization facilities for high-performance threading may want to do a bit 
more than just execute a barrier instruction on another CPU that has a 
relevant thread running.

You cited signal based numbers:

 > (what we have now, with dynamic sys_membarrier check, expedited scheme)
 > memory barriers in reader: 907693804 reads, 817793 writes
 > sys_membarrier scheme:    4316818891 reads, 503790 writes
 >
 > (dynamic sys_membarrier check, non-expedited scheme)
 > memory barriers in reader: 907693804 reads, 817793 writes
 > sys_membarrier scheme:    8698725501 reads,    313 writes

Much of that signal handler overhead is i think due to:

 - FPU/SSE context save/restore
 - the need to wake up, run and deschedule all threads

Instead i'd suggest for you to try to implement user-space RCU speedups not 
via the new sys_membarrier() syscall, but via two new signal extensions:

 - SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special 
   purpose signal handlers? (can whip up a quick patch for you if you want)

 - SA_RUNNING: a way to signal only running threads - as a way for user-space 
   based concurrency control mechanisms to deschedule running threads (or, like
   in your case, to implement barrier / garbage collection schemes).

   ( Note: to properly sync back you'll also need an sa_info field to tell
     target tasks how many tasks were woken up. That way a futex can be used 
     as a semaphore to signal back to the issuing thread, and make it all 
     properly event triggered and nicely scalable. Also, queued signals are a 
     must for such a scheme. )

My estimation is that it will be _much_ faster than the naive signal based 
approach - maybe even quite comparable to an open-coded sys_membarrier():

 - as most of the overhead in a real scenario ought to be the IPI sending and 
   latency - not the syscall entry/exit. (with a signal approach we'd still go
   into target thread user-mode, so one more syscall exit+re-entry)

 - or for the common case where there are no other threads running, we are 
   just in/out of SA_RUNNING without having to do any synchronization. In that
   case it should be quite close to sys_membarrier() - modulo some minimal 
   signal API overhead. [which we could optimize some more, if it's visible in
   your benchmarks.]

Signals per se are pretty scalable these days - now that most of the fastpaths 
are decoupled from tasklist_lock and everything is RCU-ized.

Further benefits are:

 - both SA_NOFPU and SA_RUNNING could be used by a _lot_ more user-space 
   facilities than just user-space RCU.

 - synergetic effects: growing some real high-performance facility based on 
   signals would ensure further signal speedups in the future as well. 
   Currently any server app that runs into signal limitations tends to shy 
   away from them and use some different (and often inferior) signalling 
   scheme. It would be better extend signals with 'lightweight' capabilities 
   as well.

All in one, signals are used by like 99.9% of Linux apps, while 
sys_membarrier() would be used only by [WAG] 0.00001% of them.

So before we can merge this (at least via the RCU tree, which you have sent it 
to), i'd like to see you try _much_, _MUCH_ harder to fix the very obvious 
signal overhead performance problems you have demoed via the numbers above so 
nicely.

If _that_ fails, and if we get all the fruits of that, _then_ we might 
perhaps, with a lot of hesitation, concede defeat and think about adding yet 
another syscall.

I know it's cool to add a brand new syscall - but, unfortunately, in practice 
it doesnt help Linux apps all that much. (at least until we have tools/klibc/ 
or so.)

[ There's also a few small cleanliness details i noticed in your patch: enums
  are a tiny bit nicer for ABIs than #define's, the #ifdef SMP is ugly, etc. - 
  but it doesnt really matter much as i think we should concentrate on the
  scalability problems of signals first. ]

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/