linux-kernel - Re: [RFC 0/8] Cpuset aware writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20070116154054.e655f75c.akpm@osdl.org>
Date:	Tue, 16 Jan 2007 15:40:54 -0800
From:	Andrew Morton <akpm@...l.org>
To:	Christoph Lameter <clameter@....com>
Cc:	menage@...gle.com, linux-kernel@...r.kernel.org,
	nickpiggin@...oo.com.au, linux-mm@...ck.org, ak@...e.de,
	pj@....com, dgc@....com
Subject: Re: [RFC 0/8] Cpuset aware writeback

> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <clameter@....com> wrote:
>
> ...
>
> > > This may result in a large percentage of a cpuset
> > > to become dirty without writeout being triggered. Under NFS
> > > this can lead to OOM conditions.
> > 
> > OK, a big question: is this patchset a performance improvement or a
> > correctness fix?  Given the above, and the lack of benchmark results I'm
> > assuming it's for correctness.
> 
> It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

> > - Why does NFS go oom?  Because it allocates potentially-unbounded
> >   numbers of requests in the writeback path?
> > 
> >   It was able to go oom on non-numa machines before dirty-page-tracking
> >   went in.  So a general problem has now become specific to some NUMA
> >   setups.
> 
> 
> Right. The issue is that large portions of memory become dirty / 
> writeback since no writeback occurs because dirty limits are not checked 
> for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
> unable to allocate memory.
>  
> >   So an obvious, equivalent and vastly simpler "fix" would be to teach
> >   the NFS client to go off-cpuset when trying to allocate these requests.
> 
> Yes we can fix these allocations by allowing processes to allocate from 
> other nodes. But then the container function of cpusets is no longer 
> there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller. 
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

> > (But is it really bad? What actual problems will it cause once NFS is fixed?)
> 
> NFS is okay as far as I can tell. dirty throttling works fine in non 
> cpuset environments because we throttle if 40% of memory becomes dirty or 
> under writeback.

Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
for existing IO to complete.  Back that up with a mempool for NFS requests
and the problem is solved, I think?

> > I don't understand why the proposed patches are cpuset-aware at all.  This
> > is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> > more general.  For example, i386 machines can presumably get into trouble
> > if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> > address that problem as well.  So I think it should all be per-zone?
> 
> No. A zone can be completely dirty as long as we are allowed to allocate 
> from other zones.

But we also can get into trouble if a *zone* is all-dirty.  Any solution to
the cpuset problem should solve that problem too, no?

> > Do we really need those per-inode cpumasks?  When page reclaim encounters a
> > dirty page on the zone LRU, we automatically know that page->mapping->host
> > has at least one dirty page in this zone, yes?  We could immediately ask
> 
> Yes, but when we enter reclaim most of the pages of a zone may already be 
> dirty/writeback so we fail.

No.  If the dirty limits become per-zone then no zone will ever have >40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process.  I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/