[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070116230034.b8cb4263.akpm@osdl.org>
Date: Tue, 16 Jan 2007 23:00:34 -0800
From: Andrew Morton <akpm@...l.org>
To: Christoph Lameter <clameter@....com>
Cc: menage@...gle.com, linux-kernel@...r.kernel.org,
nickpiggin@...oo.com.au, linux-mm@...ck.org, ak@...e.de,
pj@....com, dgc@....com
Subject: Re: [RFC 0/8] Cpuset aware writeback
> On Tue, 16 Jan 2007 22:27:36 -0800 (PST) Christoph Lameter <clameter@....com> wrote:
> On Tue, 16 Jan 2007, Andrew Morton wrote:
>
> > > Yes this is the result of the hierachical nature of cpusets which already
> > > causes issues with the scheduler. It is rather typical that cpusets are
> > > used to partition the memory and cpus. Overlappig cpusets seem to have
> > > mainly an administrative function. Paul?
> >
> > The typical usage scenarios don't matter a lot: the examples I gave show
> > that the core problem remains unsolved. People can still hit the bug.
>
> I agree the overlap issue is a problem and I hope it can be addressed
> somehow for the rare cases in which such nesting takes place.
>
> One easy solution may be to check the dirty ratio before engaging in
> reclaim. If the dirty ratio is sufficiently high then trigger writeout via
> pdflush (we already wakeup pdflush while scanning and you already noted
> that pdflush writeout is not occurring within the context of the current
> cpuset) and pass over any dirty pages during LRU scans until some pages
> have been cleaned up.
>
> This means we allow allocation of additional kernel memory outside of the
> cpuset while triggering writeout of inodes that have pages on the nodes
> of the cpuset. The memory directly used by the application is still
> limited. Just the temporary information needed for writeback is allocated
> outside.
Gad. None of that should be necessary.
> Well sounds somehow still like a hack. Any other ideas out there?
Do what blockdevs do: limit the number of in-flight requests (Peter's
recent patch seems to be doing that for us) (perhaps only when PF_MEMALLOC
is in effect, to keep Trond happy) and implement a mempool for the NFS
request critical store. Additionally:
- we might need to twiddle the NFS gfp_flags so it doesn't call the
oom-killer on failure: just return NULL.
- consider going off-cpuset for critical allocations. It's better than
going oom. A suitable implementation might be to ignore the caller's
cpuset if PF_MEMALLOC. Maybe put a WARN_ON_ONCE in there: we prefer that
it not happen and we want to know when it does.
btw, regarding the per-address_space node mask: I think we should free it
when the inode is clean (!mapping_tagged(PAGECACHE_TAG_DIRTY)). Chances
are, the inode will be dirty for 30 seconds and in-core for hours. We
might as well steal its nodemask storage and give it to the next file which
gets written to. A suitable place to do all this is in
__mark_inode_dirty(I_DIRTY_PAGES), using inode_lock to protect
address_space.dirty_page_nodemask.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists