linux-kernel - Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100107174505.GA9612@Krystal>
Date:	Thu, 7 Jan 2010 12:45:05 -0500
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Josh Triplett <josh@...htriplett.org>
Cc:	linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>, akpm@...ux-foundation.org,
	tglx@...utronix.de, peterz@...radead.org, rostedt@...dmis.org,
	Valdis.Kletnieks@...edu, dhowells@...hat.com, laijs@...fujitsu.com,
	dipankar@...ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory
	barrier

* Josh Triplett (josh@...htriplett.org) wrote:
> On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
[...]
> > Just tried it with a 10,000,000 iterations loop.
> > 
> > The thread doing the system call loop takes 2.0% of user time, 98% of
> > system time. All other cpus are nearly 100.0% idle. Just to give a bit
> > more info about my test setup, I also have a thread sitting on a CPU
> > busy-waiting for the loop to complete. This thread takes 97.7% user
> > time (but it really is just there to make sure we are indeed doing the
> > IPIs, not skipping it through the thread_group_empty(current) test). If
> > I remove this thread, the execution time of the test program shrinks
> > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> > executed in the first place, because removing the extra thread
> > accelerates the loop tremendously. I used a 8-core Xeon to test.
> 
> Do you know if the kernel properly measures the overhead of IPIs?  The
> CPUs might have only looked idle.  What about running some kind of
> CPU-bound benchmark on the other CPUs and testing the completion time
> with and without the process running the membarrier loop?

Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally:                                              real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... the unrelated processes become 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):

IPI to all:                                            real 0m44.708s
alloc cpumask+local mb()+IPI-many to 1 thread:         real 1m2.034s

So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:

local mb()+single IPI to 1 thread:                     real 0m29.502s
local mb()+single IPI to 7 threads:                    real 2m30.971s

So, roughly, the single IPI overhead is 120s here for 6 more threads,
for 20s per thread.

Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to
have an IPI sent), we send a "single IPI". This will be N-1 IPI and a
local function call. If we need more than that, then we switch to the
cpumask allocation and send a broadcast IPI to the cpumask we construct
for the rest of the CPUs. Let's call it the "adaptative IPI scheme".

For my Intel Xeon E5405:

Just doing local mb()+single IPI to T other threads:

T=1: 0m29.219s
T=2: 0m46.310s
T=3: 1m10.172s
T=4: 1m24.822s
T=5: 1m43.205s
T=6: 2m15.405s
T=7: 2m31.207s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m39.605s
T=2: 0m48.566s
T=3: 0m50.167s
T=4: 0m57.896s
T=5: 0m56.411s
T=6: 1m0.536s
T=7: 1m12.532s

So I think the right threshold should be around 2 threads (assuming
other architecture will behave like mine). So starting 3 threads, we
allocate the cpumask and send IPIs.

How does that sound ?

[...]

> 
> > > - Part of me thinks this ought to become slightly more general, and just
> > >   deliver a signal that the receiving thread could handle as it likes.
> > >   However, that would certainly prove more expensive than this, and I
> > >   don't know that the generality would buy anything.
> > 
> > A general scheme would have to call every threads, even those which are
> > not running. In the case of this system call, this is a particular case
> > where we can forget about non-running threads, because the memory
> > barrier is implied by the scheduler activity that brought them offline.
> > So I really don't see how we can use this IPI scheme for other things
> > that this kind of synchronization.
> 
> No, I don't mean non-running threads.  If you wanted that, you could do
> what urcu currently does, and send a signal to all threads.  I meant
> something like "signal all *running* threads from my process".

Well, if you find me a real-life use-case, then we can surely look into
that ;)

> 
> > > - Could you somehow register reader threads with the kernel, in a way
> > >   that makes them easy to detect remotely?
> > 
> > There are two ways I figure out we could do this. One would imply adding
> > extra shared data between kernel and userspace (which I'd like to avoid,
> > to keep coupling low). The other alternative would be to add per
> > task_struct information about this, and new system calls. The added per
> > task_struct information would use up cache lines (which are very
> > important, especially in the task_struct) and the added system call at
> > rcu_read_lock/unlock() would simply kill performance.
> 
> No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
> I meant that you would do a system call when the reader threads start,
> saying "hey, reader thread here".

Hrm, we need to inform the userspace RCU library that this thread is
present too. So I don't see how going through the kernel helps us there.

Thanks,

Mathieu

> 
> - Josh Triplett

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/