linux-kernel - Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim and use a

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 9 Jun 2010 11:52:11 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...cle.com>,
	Nick Piggin <npiggin@...e.de>, Rik van Riel <riel@...hat.com>
Subject: Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaim
 and use a_ops->writepages() where possible

On Tue,  8 Jun 2010 10:02:19 +0100
Mel Gorman <mel@....ul.ie> wrote:

> I finally got a chance last week to visit the topic of direct reclaim
> avoiding the writing out pages. As it came up during discussions the last
> time, I also had a stab at making the VM writing ranges of pages instead
> of individual pages. I am not proposing for merging yet until I want to see
> what people think of this general direction and if we can agree on if this
> is the right one or not.
> 
> To summarise, there are two big problems with page reclaim right now. The
> first is that page reclaim uses a_op->writepage to write a back back
> under the page lock which is inefficient from an IO perspective due to
> seeky patterns.  The second is that direct reclaim calling the filesystem
> splices two potentially deep call paths together and potentially overflows
> the stack on complex storage or filesystems. This series is an early draft
> at tackling both of these problems and is in three stages.
> 
> The first 4 patches are a forward-port of trace points that are partly
> based on trace points defined by Larry Woodman but never merged. They trace
> parts of kswapd, direct reclaim, LRU page isolation and page writeback. The
> tracepoints can be used to evaluate what is happening within reclaim and
> whether things are getting better or worse. They do not have to be part of
> the final series but might be useful during discussion.
> 
> Patch 5 writes out contiguous ranges of pages where possible using
> a_ops->writepages. When writing a range, the inode is pinned and the page
> lock released before submitting to writepages(). This potentially generates
> a better IO pattern and it should avoid a lock inversion problem within the
> filesystem that wants the same page lock held by the VM. The downside with
> writing ranges is that the VM may not be generating more IO than necessary.
> 
> Patch 6 prevents direct reclaim writing out pages at all and instead dirty
> pages are put back on the LRU. For lumpy reclaim, the caller will briefly
> wait on dirty pages to be written out before trying to reclaim the dirty
> pages a second time.
> 
> The last patch increases the responsibility of kswapd somewhat because
> it's now cleaning pages on behalf of direct reclaimers but kswapd seemed
> a better fit than background flushers to clean pages as it knows where the
> pages needing cleaning are. As it's async IO, it should not cause kswapd to
> stall (at least until the queue is congested) but the order that pages are
> reclaimed on the LRU is altered. Dirty pages that would have been reclaimed
> by direct reclaimers are getting another lap on the LRU. The dirty pages
> could have been put on a dedicated list but this increased counter overhead
> and the number of lists and it is unclear if it is necessary.
> 
> The series has survived performance and stress testing, particularly around
> high-order allocations on X86, X86-64 and PPC64. The results of the tests
> showed that while lumpy reclaim has a slightly lower success rate when
> allocating huge pages but it was still very acceptable rates, reclaim was
> a lot less disruptive and allocation latency was lower.
> 
> Comments?
> 

My concern is how memcg should work. IOW, what changes will be necessary for
memcg to work with the new vmscan logic as no-direct-writeback.

Maybe an ideal solution will be
 - support buffered I/O tracking in I/O cgroup.
 - flusher threads should work with I/O cgroup.
 - memcg itself should support dirty ratio. and add a trigger to kick flusher
   threads for dirty pages in a memcg.
But I know it's a long way.

How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
memcg has to wait for a flusher thread make pages clean ?
Or memcg should have kswapd-for-memcg ?

Is it okay to call writeback directly when !scanning_global_lru() ?
memcg's reclaim routine is only called from specific positions, so, I guess
no stack problem. But we just have I/O pattern problem.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/