[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1001080911340.7821@localhost.localdomain>
Date: Fri, 8 Jan 2010 09:22:14 -0800 (PST)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Peter Zijlstra <peterz@...radead.org>
cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Minchan Kim <minchan.kim@...il.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, cl@...ux-foundation.org,
"hugh.dickins" <hugh.dickins@...cali.co.uk>,
Nick Piggin <nickpiggin@...oo.com.au>,
Ingo Molnar <mingo@...e.hu>
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
On Fri, 8 Jan 2010, Peter Zijlstra wrote:
> On Tue, 2010-01-05 at 20:20 -0800, Linus Torvalds wrote:
> >
> > Yeah, I should have looked more at your callchain. That's nasty. Much
> > worse than the per-mm lock. I thought the page buffering would avoid the
> > zone lock becoming a huge problem, but clearly not in this case.
>
> Right, so I ran some numbers on a multi-socket (2) machine as well:
>
> pf/min
>
> -tip 56398626
> -tip + xadd 174753190
> -tip + speculative 189274319
> -tip + xadd + speculative 200174641
>
> [ variance is around 0.5% for this workload, ran most of these numbers
> with --repeat 5 ]
That's a huge jump. It's clear that the spinlock-based rwsem's simply
suck. The speculation gets rid of some additional mmap_sem contention,
but at least for two sockets it looks like the rwsem implementation was
the biggest problem by far.
> At both the xadd/speculative point the workload is dominated by the
> zone->lock, the xadd+speculative removes some of the contention, and
> removing the various RSS counters could yield another few percent
> according to the profiles, but then we're pretty much there.
I don't know if worrying about a few percent is worth it. "Perfect is the
enemy of good", and the workload is pretty dang artificial with the whole
"remove pages and re-fault them as fast as you can".
So the benchmark is pointless and extreme, and I think it's not worth
worrying too much about details. Especially when compared to just the
*three-fold* jump from just the fairly trivial rwsem implementation change
(with speculation on top of it then adding another 15% improvement -
nothing to sneeze at, but it's still in a different class).
Of course, larger numbers of sockets will likely change the situation, but
at the same time I do suspect that workloads designed for hundreds of
cores will need to try to behave better than that benchmark anyway ;)
> One way around those RSS counters is to track it per task, a quick grep
> shows its only the oom-killer and proc that use them.
>
> A quick hack removing them gets us: 203158058
Yeah, well.. After that 200% and 15% improvement, a 1.5% improvement on a
totally artificial benchmark looks less interesting.
Because let's face it - if your workload does several million page faults
per second, you're just doing something fundamentally _wrong_.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists