linux-kernel - Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1209270935080.5162@asgard.lang.hm>
Date:	Thu, 27 Sep 2012 09:48:19 -0700 (PDT)
From:	david@...g.hm
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Borislav Petkov <bp@...en8.de>, Mike Galbraith <efault@....de>,
	Mel Gorman <mgorman@...e.de>,
	Nikolay Ulyanitsky <lystor@...il.com>,
	linux-kernel@...r.kernel.org,
	Andreas Herrmann <andreas.herrmann3@....com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to
 3.6-rc5 on AMD chipsets - bisected

On Thu, 27 Sep 2012, Peter Zijlstra wrote:

> On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
>>
>> For example, it starts with the maximum target scheduling domain, and
>> works its way in over the scheduling groups within that domain. What
>> the f*ck is the logic of that kind of crazy thing? It never makes
>> sense to look at a biggest domain first.
>
> That's about SMT, it was felt that you don't want SMT siblings first
> because typically SMT siblings are somewhat under-powered compared to
> actual cores.
>
> Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
> only has the LLC domain, if you want more we'll need to fix that. For
> now its a fixed:
>
> SMT
> MC (llc)
> CPU (package/machine-for-!numa)
> NUMA
>
> So in your patch, your for_each_domain() loop will really only do the
> SMT/MC levels and prefer an SMT sibling over an idle core.

I think you are bing too smart for your own good. you don't know if it's 
best to move them further apart or not. I'm arguing that you can't know.

so I'm saying do the simple thing.

if a core is overloaded, move to an idle core that is as close as possible 
to the core you start from (as much shared as possible).

if this does not overload the shared resource, you did the right thing.

if this does overload the shared resource, it's still no worse than 
leaving it on the original core (which was shared everything, so you've 
reduced the sharing a little bit)

the next balancing cycle you then work to move something again, and since 
both the original and new core show as overloaded (due to the contention 
on the shared resources), you move something to another core that shares 
just a little less.

Yes, this means that it may take more balancing cycles to move things far 
enough apartto reduce the sharing enough to avoid overload of the shared 
resource, but I don't see any way that you can possibly guess if two 
processes are going to overload the shared resource ahead of time.

It may be that simply moving to a HT core (and no longer contending for 
registers) is enough to let both processes fly, or it may be that the 
overload is in a shared floating point unit or L1 cache and you need to 
move further away, or you may find the contention is in the L2 cache and 
move further away, or it could be in the L3 cache, or it could be in the 
memory interface (NUMA)

Without being able to predict the future, you don't know how far away you 
need to move the tasks to have them operate at th eoptimal level. All that 
you do know is that the shorter the move, the less expensive the move. So 
make each move be as short as possible, and measure again to see if that 
was enough.

For some workloads, it will be. For many workloads the least expensive 
move won't be.

The question is if doing multiple, cheap moves (requiring simple checking 
for each moves) ends up being a win compared to do better guessing over 
when the more expensive moves are worth it.

Give how chips change from year to year, I don't see how the 'better 
guessing' is going to survive more than a couple of chip releases in any 
case.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/