[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100107174505.GA9612@Krystal>
Date: Thu, 7 Jan 2010 12:45:05 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To: Josh Triplett <josh@...htriplett.org>
Cc: linux-kernel@...r.kernel.org,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Ingo Molnar <mingo@...e.hu>, akpm@...ux-foundation.org,
tglx@...utronix.de, peterz@...radead.org, rostedt@...dmis.org,
Valdis.Kletnieks@...edu, dhowells@...hat.com, laijs@...fujitsu.com,
dipankar@...ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory
barrier
* Josh Triplett (josh@...htriplett.org) wrote:
> On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
[...]
> > Just tried it with a 10,000,000 iterations loop.
> >
> > The thread doing the system call loop takes 2.0% of user time, 98% of
> > system time. All other cpus are nearly 100.0% idle. Just to give a bit
> > more info about my test setup, I also have a thread sitting on a CPU
> > busy-waiting for the loop to complete. This thread takes 97.7% user
> > time (but it really is just there to make sure we are indeed doing the
> > IPIs, not skipping it through the thread_group_empty(current) test). If
> > I remove this thread, the execution time of the test program shrinks
> > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> > executed in the first place, because removing the extra thread
> > accelerates the loop tremendously. I used a 8-core Xeon to test.
>
> Do you know if the kernel properly measures the overhead of IPIs? The
> CPUs might have only looked idle. What about running some kind of
> CPU-bound benchmark on the other CPUs and testing the completion time
> with and without the process running the membarrier loop?
Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs.
Normally: real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s
So... the unrelated processes become 2x slower. That hurts.
So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.
Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):
IPI to all: real 0m44.708s
alloc cpumask+local mb()+IPI-many to 1 thread: real 1m2.034s
So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:
local mb()+single IPI to 1 thread: real 0m29.502s
local mb()+single IPI to 7 threads: real 2m30.971s
So, roughly, the single IPI overhead is 120s here for 6 more threads,
for 20s per thread.
Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to
have an IPI sent), we send a "single IPI". This will be N-1 IPI and a
local function call. If we need more than that, then we switch to the
cpumask allocation and send a broadcast IPI to the cpumask we construct
for the rest of the CPUs. Let's call it the "adaptative IPI scheme".
For my Intel Xeon E5405:
Just doing local mb()+single IPI to T other threads:
T=1: 0m29.219s
T=2: 0m46.310s
T=3: 1m10.172s
T=4: 1m24.822s
T=5: 1m43.205s
T=6: 2m15.405s
T=7: 2m31.207s
Just doing cpumask alloc+IPI-many to T other threads:
T=1: 0m39.605s
T=2: 0m48.566s
T=3: 0m50.167s
T=4: 0m57.896s
T=5: 0m56.411s
T=6: 1m0.536s
T=7: 1m12.532s
So I think the right threshold should be around 2 threads (assuming
other architecture will behave like mine). So starting 3 threads, we
allocate the cpumask and send IPIs.
How does that sound ?
[...]
>
> > > - Part of me thinks this ought to become slightly more general, and just
> > > deliver a signal that the receiving thread could handle as it likes.
> > > However, that would certainly prove more expensive than this, and I
> > > don't know that the generality would buy anything.
> >
> > A general scheme would have to call every threads, even those which are
> > not running. In the case of this system call, this is a particular case
> > where we can forget about non-running threads, because the memory
> > barrier is implied by the scheduler activity that brought them offline.
> > So I really don't see how we can use this IPI scheme for other things
> > that this kind of synchronization.
>
> No, I don't mean non-running threads. If you wanted that, you could do
> what urcu currently does, and send a signal to all threads. I meant
> something like "signal all *running* threads from my process".
Well, if you find me a real-life use-case, then we can surely look into
that ;)
>
> > > - Could you somehow register reader threads with the kernel, in a way
> > > that makes them easy to detect remotely?
> >
> > There are two ways I figure out we could do this. One would imply adding
> > extra shared data between kernel and userspace (which I'd like to avoid,
> > to keep coupling low). The other alternative would be to add per
> > task_struct information about this, and new system calls. The added per
> > task_struct information would use up cache lines (which are very
> > important, especially in the task_struct) and the added system call at
> > rcu_read_lock/unlock() would simply kill performance.
>
> No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
> I meant that you would do a system call when the reader threads start,
> saying "hey, reader thread here".
Hrm, we need to inform the userspace RCU library that this thread is
present too. So I don't see how going through the kernel helps us there.
Thanks,
Mathieu
>
> - Josh Triplett
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists