lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 26 Sep 2012 11:19:42 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Borislav Petkov <bp@...en8.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mike Galbraith <efault@....de>, Mel Gorman <mgorman@...e.de>,
	Nikolay Ulyanitsky <lystor@...il.com>,
	linux-kernel@...r.kernel.org,
	Andreas Herrmann <andreas.herrmann3@....com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to
 3.6-rc5 on AMD chipsets - bisected

On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov <bp@...en8.de> wrote:
> On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case. Sad, because it really would be lovely to just remove that
>> thing ;)
>
> Yep, correct. It hurts.

I'm *so* not surprised.

That said, I think your "kill select_idle_sibling()" one was
interesting, but the wrong kind of "get rid of that logic".

It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying "does this group actually cover the cpu I am
interested in?"

Please tell me I am mis-reading this?

But starting from the biggest ("llc" group) is wrong *anyway*, since
it means that it starts looking at the L3 level, and then if it finds
an acceptable cpu inside that level, it's all done. But that's
*crazy*. Once again, it's much better to try to find an idle sibling
*closeby* rather than at the L3 level. No? So once again, we should
start at the inner level and if we can't find something really close,
we work our way out, rather than starting from the outer level and
working our way in.

If I read the code correctly, we can have both "prev" and "cpu" in the
same L2 domain, but because we start looking at the L3 domain, we may
end up picking another "affine" CPU that isn't even sharing L2's
*before* we pick one that actually *is* sharing L2's with the target
CPU. But that code is confusing enough with the scheduler groups inner
loop that maybe I am mis-reading it entirely.

There are other oddities in select_idle_sibling() too, if I read
things correctly.

For example, it uses "cpu_idle(target)", but if we're actively trying
to move to the current CPU (ie wake_affine() returned true), then
target is the current cpu, which is certainly *not* going to be idle
for a sync wakeup. So it should actually check whether it's a sync
wakeup and the only thing pending is that synchronous waker, no?

Maybe I'm missing something really fundamental, but it all really does
look very odd to me.

Attached is a totally untested and probably very buggy patch, so
please consider it a "shouldn't we do something like this instead" RFC
rather than anything serious. So this RFC patch is more a "ok, the
patch tries to fix the above oddnesses, please tell me where I went
wrong" than anything else.

Comments?

                    Linus

Download attachment "patch.diff" of type "application/octet-stream" (3060 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ