[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1168933090.22935.30.camel@twins>
Date: Tue, 16 Jan 2007 08:38:10 +0100
From: Peter Zijlstra <a.p.zijlstra@...llo.nl>
To: Christoph Lameter <clameter@....com>
Cc: akpm@...l.org, Paul Menage <menage@...gle.com>,
linux-kernel@...r.kernel.org,
Nick Piggin <nickpiggin@...oo.com.au>, linux-mm@...ck.org,
Andi Kleen <ak@...e.de>, Paul Jackson <pj@....com>,
Dave Chinner <dgc@....com>
Subject: Re: [RFC 0/8] Cpuset aware writeback
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
>
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
>
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
>
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
>
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
>
> After we have the cpuset throttling in place we can then make
> further fixups:
>
> A. We can do inode based writeout from direct reclaim
> avoiding single page writes to the filesystem.
>
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
> from the available pages in a node. This allows us to
> accurately calculate the dirty ratio even if large portions
> of the node have been allocated for huge pages or for
> slab pages.
What about mlock'ed pages?
> There are a couple of points where some better ideas could be used:
>
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.
Not the prettiest indeed, no ideas though.
> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.
Well, you gain functionality, you loose some runtime, sad but probably
worth it.
Otherwise it all looks good.
Acked-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists