[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1209270935080.5162@asgard.lang.hm>
Date: Thu, 27 Sep 2012 09:48:19 -0700 (PDT)
From: david@...g.hm
To: Peter Zijlstra <a.p.zijlstra@...llo.nl>
cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Borislav Petkov <bp@...en8.de>, Mike Galbraith <efault@....de>,
Mel Gorman <mgorman@...e.de>,
Nikolay Ulyanitsky <lystor@...il.com>,
linux-kernel@...r.kernel.org,
Andreas Herrmann <andreas.herrmann3@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to
3.6-rc5 on AMD chipsets - bisected
On Thu, 27 Sep 2012, Peter Zijlstra wrote:
> On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
>>
>> For example, it starts with the maximum target scheduling domain, and
>> works its way in over the scheduling groups within that domain. What
>> the f*ck is the logic of that kind of crazy thing? It never makes
>> sense to look at a biggest domain first.
>
> That's about SMT, it was felt that you don't want SMT siblings first
> because typically SMT siblings are somewhat under-powered compared to
> actual cores.
>
> Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
> only has the LLC domain, if you want more we'll need to fix that. For
> now its a fixed:
>
> SMT
> MC (llc)
> CPU (package/machine-for-!numa)
> NUMA
>
> So in your patch, your for_each_domain() loop will really only do the
> SMT/MC levels and prefer an SMT sibling over an idle core.
I think you are bing too smart for your own good. you don't know if it's
best to move them further apart or not. I'm arguing that you can't know.
so I'm saying do the simple thing.
if a core is overloaded, move to an idle core that is as close as possible
to the core you start from (as much shared as possible).
if this does not overload the shared resource, you did the right thing.
if this does overload the shared resource, it's still no worse than
leaving it on the original core (which was shared everything, so you've
reduced the sharing a little bit)
the next balancing cycle you then work to move something again, and since
both the original and new core show as overloaded (due to the contention
on the shared resources), you move something to another core that shares
just a little less.
Yes, this means that it may take more balancing cycles to move things far
enough apartto reduce the sharing enough to avoid overload of the shared
resource, but I don't see any way that you can possibly guess if two
processes are going to overload the shared resource ahead of time.
It may be that simply moving to a HT core (and no longer contending for
registers) is enough to let both processes fly, or it may be that the
overload is in a shared floating point unit or L1 cache and you need to
move further away, or you may find the contention is in the L2 cache and
move further away, or it could be in the L3 cache, or it could be in the
memory interface (NUMA)
Without being able to predict the future, you don't know how far away you
need to move the tasks to have them operate at th eoptimal level. All that
you do know is that the shorter the move, the less expensive the move. So
make each move be as short as possible, and measure again to see if that
was enough.
For some workloads, it will be. For many workloads the least expensive
move won't be.
The question is if doing multiple, cheap moves (requiring simple checking
for each moves) ends up being a win compared to do better guessing over
when the more expensive moves are worth it.
Give how chips change from year to year, I don't see how the 'better
guessing' is going to survive more than a couple of chip releases in any
case.
David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists