[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100110011144.GD25790@Krystal>
Date: Sat, 9 Jan 2010 20:11:44 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
Oleg Nesterov <oleg@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
akpm@...ux-foundation.org, josh@...htriplett.org,
tglx@...utronix.de, Valdis.Kletnieks@...edu, dhowells@...hat.com,
laijs@...fujitsu.com, dipankar@...ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory
barrier
* Paul E. McKenney (paulmck@...ux.vnet.ibm.com) wrote:
> On Sat, Jan 09, 2010 at 02:20:06PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@...ux.vnet.ibm.com) wrote:
> > > On Fri, Jan 08, 2010 at 09:38:42PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@...ux.vnet.ibm.com) wrote:
> > > > > On Fri, Jan 08, 2010 at 08:02:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@...ux.vnet.ibm.com) wrote:
> > > > > > > On Fri, Jan 08, 2010 at 06:53:38PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Steven Rostedt (rostedt@...dmis.org) wrote:
> > > > > > > > > Well, if we just grab the task_rq(task)->lock here, then we should be
> > > > > > > > > OK? We would guarantee that curr is either the task we want or not.
> > > > > > > >
> > > > > > > > Hrm, I just tested it, and there seems to be a significant performance
> > > > > > > > penality involved with taking these locks for each CPU, even with just 8
> > > > > > > > cores. So if we can do without the locks, that would be preferred.
> > > > > > >
> > > > > > > How significant? Factor of two? Two orders of magnitude?
> > > > > > >
> > > > > >
> > > > > > On a 8-core Intel Xeon (T is the number of threads receiving the IPIs):
> > > > > >
> > > > > > Without runqueue locks:
> > > > > >
> > > > > > T=1: 0m13.911s
> > > > > > T=2: 0m20.730s
> > > > > > T=3: 0m21.474s
> > > > > > T=4: 0m27.952s
> > > > > > T=5: 0m26.286s
> > > > > > T=6: 0m27.855s
> > > > > > T=7: 0m29.695s
> > > > > >
> > > > > > With runqueue locks:
> > > > > >
> > > > > > T=1: 0m15.802s
> > > > > > T=2: 0m22.484s
> > > > > > T=3: 0m24.751s
> > > > > > T=4: 0m29.134s
> > > > > > T=5: 0m30.094s
> > > > > > T=6: 0m33.090s
> > > > > > T=7: 0m33.897s
> > > > > >
> > > > > > So on 8 cores, taking spinlocks for each of the 8 runqueues adds about
> > > > > > 15% overhead when doing an IPI to 1 thread. Therefore, that won't be
> > > > > > pretty on 128+-core machines.
> > > > >
> > > > > But isn't the bulk of the overhead the IPIs rather than the runqueue
> > > > > locks?
> > > > >
> > > > > W/out RQ W/RQ % degradation
> > > > fix:
> > > > W/out RQ W/RQ ratio
> > > > > T=1: 13.91 15.8 1.14
> > > > > T=2: 20.73 22.48 1.08
> > > > > T=3: 21.47 24.75 1.15
> > > > > T=4: 27.95 29.13 1.04
> > > > > T=5: 26.29 30.09 1.14
> > > > > T=6: 27.86 33.09 1.19
> > > > > T=7: 29.7 33.9 1.14
> > > >
> > > > These numbers tell you that the degradation is roughly constant as we
> > > > add more threads (let's consider 1 thread per core, 1 IPI per thread,
> > > > with active threads). It is all run on a 8-core system will all cpus
> > > > active. As we increase the number of IPIs (e.g. T=2 -> T=7) we add 9s,
> > > > for 1.8s/IPI (always for 10,000,000 sys_membarrier() calls), for an
> > > > added 180 ns/core per call. (note: T=1 is a special-case, as I do not
> > > > allocate any cpumask.)
> > > >
> > > > Using the spinlocks adds about 3s for 10,000,000 sys_membarrier() calls
> > > > or a 8-core system, for an added 300 ns/core per call.
> > > >
> > > > So the overhead of taking the task lock is about twice higher, per core,
> > > > than the overhead of the IPIs. This is understandable if the
> > > > architecture does an IPI broadcast: the scalability problem then boils
> > > > down to exchange cache-lines to inform the ipi sender that the other
> > > > cpus have completed. An atomic operation exchanging a cache-line would
> > > > be expected to be within the irqoff+spinlock+spinunlock+irqon overhead.
> > >
> > > Let me rephrase the question... Isn't the vast bulk of the overhead
> > > something other than the runqueue spinlocks?
> >
> > I don't think so. What we have here is:
> >
> > O(1)
> > - a system call
> > - cpumask allocation
> > - IPI broadcast
> >
> > O(nr cpus)
> > - wait for IPI handlers to complete
> > - runqueue spinlocks
> >
> > The O(1) operations seems to be about 5x slower than the combined
> > O(nr cpus) wait and spinlock operations, but this only means that as
> > soon as we have 8 cores, then the bulk of the overhead sits in the
> > runqueue spinlock (if we have to take them).
> >
> > If we don't take spinlocks, then we can go up to 16 cores before the
> > bulk of the overhead starts to be the "wait for IPI handlers to
> > complete" phase. As you pointed out, we could turn this wait phase into
> > a tree hierarchy. However, we cannot do this with the spinlocks, as they
> > have to be taken for the initial cpumask iteration.
> >
> > Therefore, if we don't have to take those spinlocks, we can have a very
> > significant gain over this system call overhead, especially on large
> > systems. Not taking spinlocks here allows us to use a tree hierarchy to
> > turn the bulk of the scalability overhead (waiting for IPI handlers to
> > complete) into a O(log(nb cpus)) complexity order, which is quite
> > interesting.
>
> All this would sound plausible if it weren't for the ratio of overheads
> for runqueue-lock and non-runqueue-lock versions not varying much from
> one to seven CPUs. ;-)
Hrm, right. for_each_cpu(cpu, mm_cpumask(current->mm)) only iterates
(thus takes locks) on active threads. So this overhead being constant is
a bit unexpected. Unless for some weird reason the mm_cpumask would
always contain all cpus, but I doubt so.
>
> And please keep in mind that this operations happens on the URCU slowpath.
> Further, there may be opportunities for much larger savings by batching
> grace-period requests, given that a single grace period can serve an
> arbitrarily large number of synchronize_rcu() requests.
That's right.
>
> > > > > So if we had lots of CPUs, we might want to fan the IPIs out through
> > > > > intermediate CPUs in a tree fashion, but the runqueue locks are not
> > > > > causing excessive pain.
> > > >
> > > > A tree hierarchy may not be useful for sending the IPIs (as, hopefully,
> > > > they can be broadcasted pretty efficiciently), but however could be
> > > > useful when waiting for the IPIs to complete efficiently.
> > >
> > > OK, given that you precompute the CPU mask, you might be able to take
> > > advantage of hardware broadcast, on architectures having it.
> > >
> > > > > How does this compare to use of POSIX signals? Never mind, POSIX
> > > > > signals are arbitrarily bad if you have way more threads than are
> > > > > actually running at the time...
> > > >
> > > > POSIX signals to all threads are terrible in that they require to wake
> > > > up all those threads. I have not even thought it useful to compare
> > > > these two approaches with benchmarks yet (I'll do that when the
> > > > sys_membarrier() support is implemented in liburcu).
> > >
> > > It would be of some interest. I bet that the runqueue spinlock overhead
> > > is -way- down in the noise by comparison to POSIX signals, even when all
> > > the threads are running. ;-)
> >
> > For 1,000,000 iterations, sending signals to execute a remote mb and
> > waiting for it to complete:
>
> Adding the previous results for comparison, and please keep in mind the
> need to multiply the left-hand column by ten before comparing to the
> right-hand column:
>
> > W/out RQ W/RQ ratio
> > T=1: 0m3.107s T=1: 13.91 15.8 1.14
> > T=2: 0m5.772s T=2: 20.73 22.48 1.08
> > T=3: 0m8.662s T=3: 21.47 24.75 1.15
> > T=4: 0m12.239s T=4: 27.95 29.13 1.04
> > T=5: 0m16.213s T=5: 26.29 30.09 1.14
> > T=6: 0m19.482s T=6: 27.86 33.09 1.19
> > T=7: 0m23.227s T=7: 29.7 33.9 1.14
>
> So sys_membarrier() is roughly a factor of two better than POSIX signals
> for a single thread, rising to not quite a factor of eight for seven
> threads. And this data -does- support the notion that POSIX signals
> get increasingly worse with increasing numbers of threads, as one
> would expect.
Yep.
>
> > So, per iteration:
> >
> > T=1: 3107 ns
> > T=2: 5772 ns
> > T=3: 8662 ns
> > T=4: 12239 ns
> > T=5: 16213 ns
> > T=6: 19482 ns
> > T=7: 23227 ns
> >
> > For an added 3000 ns per extra thread. So, yes, the added 300 ns/core
> > for spinlocks is almost lost in the noise compared to the signal-based
> > solution, but it's not because the old solution was behaving so poorly
> > that we can rely on it to say what is noise vs not in the current
> > implementation. Looking at what the scalability bottlenecks are, and
> > looking at what is noise within the current implementation seems like
> > a more appropriate way to design an efficient system call.
>
> I agree that your measurements show a marked improvement compared to
> POSIX signals that increases with increasing numbers of threads, again,
> as expected.
I knew it! Why did we need a proof again ? (just kidding) ;)
>
> > So all in all, we can expect around 6.25-fold improvement because we
> > diminish the per-core overhead if we use the spinlocks (480 ns/core vs
> > 3000 ns/core), but if we don't take the runqueue spinlocks (180
> > ns/core), then we are contemplating a 16.7-fold improvement. And this is
> > without considering a tree-hierarchy for waiting for IPIs to complete,
> > which would additionally change the order of the scalability overhead
> > from O(n) to O(log(n)).
>
> Eh? You should be able to use the locks to accumulate a cpumask, then
> use whatever you want, including a tree hierarchy, to both send and wait
> for the IPIs, right?
Yes, although I don't see that a tree hierarchy is useful at all when
the architecture supports IPI broadcast efficiently. It's only useful
when waiting for these IPIs to complete.
>
> Keep in mind that the runqueue locks just force memory ordering on the
> sampling. It should not be necessary to hold them while sending the
> IPIs. Or am I missing something?
Yes, that's true. We're just talking about a constant cost per thread
here (taking/releasing the runqueue spinlock).
Mathieu
>
> Thanx, Paul
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists