[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200710271558.37514.phillips@phunq.net>
Date: Sat, 27 Oct 2007 15:58:36 -0700
From: Daniel Phillips <phillips@...nq.net>
To: Christoph Lameter <clameter@....com>
Cc: Pavel Machek <pavel@....cz>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
dkegel@...gle.com, David Miller <davem@...emloft.net>,
Nick Piggin <npiggin@...e.de>
Subject: Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)
On Friday 26 October 2007 10:55, Christoph Lameter wrote:
> On Fri, 26 Oct 2007, Pavel Machek wrote:
> > > And, _no_, it does not necessarily mean global serialisation. By
> > > simply saying there must be N pages available I say nothing about
> > > on which node they should be available, and the way the
> > > watermarks work they will be evenly distributed over the
> > > appropriate zones.
> >
> > Agreed. Scalability of emergency swapping reserved is simply
> > unimportant. Please, lets get swapping to _work_ first, then we can
> > make it faster.
>
> Global reserve means that any cpuset that runs out of memory may
> exhaust the global reserve and thereby impact the rest of the system.
> The emergencies that are currently localized to a subset of the
> system and may lead to the failure of a job may now become global and
> lead to the failure of all jobs running on it.
If it does, it is a bug in the reserve accounting. That said, I still
agree with you that per-node reserve is a desirable goal for numa. I
would just like to be clear that it is not necessary, even for numa,
just nice. By all means somebody should be hacking on a numa feature
for per-node emergency reserves, but as far as fixing the immediate,
serious kernel block IO deadlocks goes, it does not matter.
Pavel, I do not agree that efficiency is unimportant on the
under-pressure path. I do not even like to call that the "emergency"
path, because under heavy load it is normal for a machine to spend a
significant fraction of its time in that state. However, the
efficiency goal there does not need to be quite the same as normal
mode.
To illustrate, I would expect to see something like 95% of normal block
IO performance on a numa machine in the case that "emergency" (aka
memalloc memory) is allocated globally instead of locally, thus paying
a (modest compared to the disk transfer itself) penalty for transfer of
disk data over the numa interconnect. 95% of normal throughput on the
block IO path is not a problem: if the machine spends 5% of its time on
the "emergency" (aka memalloc) path, then overall efficiency will be
95% * 95% = 99.75%.
Moral of this story: let's get the memory recursion fixes done in the
most obviously correct way and not get distracted by illusory
efficiency requirements for numa, that do not have a big bottom line
impact.
I'm glad to see everybody still interested in these problems. Though we
have been a little quiet on this issue over here for a while, it does
not mean that progress has stopped. In fact, we are testing our
solutions more heavily than ever, and getting closer to a solution that
not only works solidly, but that should enable mass deletion of the
whole creaky notion of dirty page limits in favor of nice, tight
per-device control of in flight write traffic as I have described
previously.
Regards,
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists