lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1332155527.18960.292.camel@twins>
Date:	Mon, 19 Mar 2012 12:12:07 +0100
From:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
To:	Avi Kivity <avi@...hat.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, Paul Turner <pjt@...gle.com>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	Mike Galbraith <efault@....de>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Dan Smith <danms@...ibm.com>,
	Bharata B Rao <bharata.rao@...il.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC][PATCH 00/26] sched/numa

On Mon, 2012-03-19 at 11:57 +0200, Avi Kivity wrote:
> On 03/16/2012 04:40 PM, Peter Zijlstra wrote:
> > The home-node migration handles both cpu and memory (anonymous only for now) in
> > an integrated fashion. The memory migration uses migrate-on-fault to avoid
> > doing a lot of work from the actual numa balancer kernl thread and only
> > migrates the active memory.
> >
> 
> IMO, this needs to be augmented with eager migration, for the following
> reasons:
> 
> - lazy migration adds a bit of latency to page faults

That's intentional, it keeps the work accounted to the tasks that need
it.

> - doesn't work well with large pages

That's for someone who cares about large pages to sort, isn't it? Also,
I thought you virt people only used THP anyway, and those work just fine
(they get broken down, and presumably something will build them back up
on the other side).

[ note that I equally dislike the THP daemon, I would have much
preferred that to be fault driven as well. ]

> - doesn't work with dma engines

How does that work anyway? You'd have to reprogram your dma engine, so
either the ->migratepage() callback does that and we're good either way,
or it simply doesn't work at all.

> So I think that in addition to migrate on fault we need a background
> thread to do eager migration.  We might prioritize pages based on the
> active bit in the PDE (cheaper to clear and scan than the PTE, but gives
> less accurate information).

I absolutely loathe background threads and page table scanners and will
do pretty much everything to avoid them.

The problem I have with farming work out to other entities is that its
thereafter terribly hard to account it back to whoemever caused the
actual work. Suppose your kworker thread consumes a lot of cpu time --
this time is then obviously not available to your application -- but how
do you find out what/who is causing this and cure it?

As to page table scanners, I simply don't see the point. They tend to
require arch support (I see aa introduces yet another PTE bit -- this
instantly limits the usefulness of the approach as lots of archs don't
have spare bits).

Also, if you go scan memory, you need some storage -- see how aa grows
struct page, sure he wants to move that storage some place else, but the
memory overhead is still there -- this means less memory to actually do
useful stuff in (it also probably means more cache-misses since his
proposed shadow array in pgdat is someplace else).

Also, the only really 'hard' case for the whole auto-numa business is
single processes that are bigger than a single node -- and those I pose
are 'rare'.

Now if you want to be able to scan per-thread, you need per-thread
page-tables and I really don't want to ever see that. That will blow
memory overhead and context switch times.

I guess you can limit the impact by only running the scanners on
selected processes, but that requires you add interfaces and then either
rely on admins or userspace to second guess application developers.

So no, I don't like that at all.

I'm still reading aa's patch, I haven't actually found anything I like
or agree with in there, but who knows, there's still some way to go.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ