linux-kernel - Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1348727665.7059.160.camel@marge.simpson.net>
Date:	Thu, 27 Sep 2012 08:34:25 +0200
From:	Mike Galbraith <efault@....de>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Borislav Petkov <bp@...en8.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mel Gorman <mgorman@...e.de>,
	Nikolay Ulyanitsky <lystor@...il.com>,
	linux-kernel@...r.kernel.org,
	Andreas Herrmann <andreas.herrmann3@....com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to
 3.6-rc5 on AMD chipsets - bisected

On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <efault@....de> wrote:
> 
> > I think the pgbench problem is more about latency for the 1 in 
> > 1:N than spinlocks.
> 
> So my understanding of the psql workload is that basically we've 
> got a central psql proxy process that is distributing work to 
> worker psql processes. If a freshly woken worker process ever 
> preempts the central proxy process then it is preventing a lot 
> of new work from getting distributed.
> 
> Correct?

Yeah, that's my understanding of the thing, and I played with it quite a
bit in the past (only refreshed memories briefly in present).

> So the central proxy psql process is 'much more important' to 
> run than any of the worker processes - an importance that is not 
> (currently) visible from the behavioral statistics the scheduler 
> keeps on tasks.

Yeah.  We had the adaptive waker thing, but it stopped being a winner at
the one load it originally did help quite a lot, and it didn't help
pgbench all that much in it's then form anyway iirc.

> So the scheduler has the following problem here: a new wakee 
> might be starved enough and the proxy might have run long enough 
> to really justify the preemption here and now. The buddy 
> statistics help avoid some of these cases - but not all and the 
> difference is measurable.
> 
> Yet the 'best' way for psql to run is for this proxy process to 
> never be preempted. Your SCHED_BATCH experiments confirmed that.

Yes.

> The way remote CPU selection affects it is that if we ever get 
> more aggressive in selecting a remote CPU then we, as a side 
> effect, also reduce the chance of harmful preemption of the 
> central proxy psql process.

Right.

> So in that sense sibling selection is somewhat of an indirect 
> red herring: it really only helps psql indirectly by preventing 
> the harmful preemption. It also, somewhat paradoxially argues 
> for suboptimal code: for example tearing apart buddies is 
> beneficial in the psql workload, because it also allows the more 
> important part of the buddy to run more (the proxy).

Yes, I believe preemption dominates, but it's not alone, you can see
that in the numbers.

> In that sense the *real* problem isnt even parallelism (although 
> we obviously should improve the decisions there - and the logic 
> has suffered in the past from the psql dilemma outlined above), 
> but whether the scheduler can (and should) identify the central 
> proxy and keep it running as much as possible, deprioritizing 
> fairness, wakeup buddies, runtime overlap and cache affinity 
> considerations.
> 
> There's two broad solutions that I can see:
> 
>  - Add a kernel solution to somehow identify 'central' processes
>    and bias them. Xorg is a similar kind of process, so it would
>    help other workloads as well. That way lie dragons, but might
>    be worth an attempt or two. We already try to do a couple of
>    robust metrics, like overlap statistics to identify buddies.

What we do now works well for X and friends I think, because there
aren't so many buddies  It might work better though, and for the same
reasons.  I've in fact [re]invented a SCHED_SERVER class a few times,
but never one that survived my own scrutiny for long.

Arrr, here there be dragons is true ;-)

> - Let user-space occasionally identify its important (and less
>    important) tasks - say psql could mark it worker processes as
>    SCHED_BATCH and keep its central process(es) higher prio. A
>    single line of obvious code in 100 KLOCs of user-space code.
> 
> Just to confirm, if you turn off all preemption via a hack 
> (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
> perform and scale much better, with the quality of sibling 
> selection and spreading of processes only being a secondary 
> effect?

That has always been the case here.  Preemption dominates.  Others
should play with it too, and let their boxen speak.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/