[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20100727152147.GB5184@localhost>
Date: Tue, 27 Jul 2010 23:21:47 +0800
From: Wu Fengguang <fengguang.wu@...el.com>
To: Mel Gorman <mel@....ul.ie>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
Dave Chinner <david@...morbit.com>,
Chris Mason <chris.mason@...cle.com>,
Nick Piggin <npiggin@...e.de>, Rik van Riel <riel@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Christoph Hellwig <hch@...radead.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when
reclaim is encountering dirty pages
On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > + /*
> > > > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > > > + */
> > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > >
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > >
> > > > > > >
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > >
> > > > > > >
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > >
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > >
> > > > >
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > >
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > >
> > >
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> >
> > Under memory pressure, the disk should be busy/congested anyway.
>
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.
Right.
> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> >
> > Overall expectation is, the more efficient IO, the more progress :)
> >
>
> Ok.
>
> > > The end result is
> > >
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > > return laptop_mode ? 0 :
> > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > >
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> >
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> >
> > laptop_mode ? 0 : (nr_dirty * 16)
> >
>
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.
OK.
> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> >
>
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.
The simplest style may be
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
.nr_to_write = MAX_WRITEBACK_PAGES,
};
mapping->writeback_index = offset;
return do_writepages(mapping, &wbc);
But sure there will be many details to handle.
> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> >
>
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
>
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> >
>
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.
OK. That's already a great work.
> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".
It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.
Thanks,
Fengguang
> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
>
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
>
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
>
> Signed-off-by: Mel Gorman <mel@....ul.ie>
> ---
> fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 6 +++-
> 4 files changed, 59 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> older_than_this = jiffies - expire_interval;
> }
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_list, &tmp);
> + }
> + }
> +
> while (!list_empty(delaying_queue)) {
> inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping || mapping == &swapper_space)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set, taking
> + * the inode lock is a waste and the inode should be pinned by
> + * the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + /*
> + * XXX: Yuck, has to be a way of batching this by not requiring
> + * the page lock to pin the inode
> + */
> + spin_lock(&inode_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode_lock);
> +unlock:
> + unlock_page(page);
> + }
> +
> + wakeup_flusher_threads(nr_pages);
> +}
> +
> static noinline void block_dump___mark_inode_dirty(struct inode *inode)
> {
> if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
> /*
> * Inode state bits. Protected by inode_lock.
> *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
> *
> * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
> #define I_DIRTY_SYNC 1
> #define I_DIRTY_DATASYNC 2
> #define I_DIRTY_PAGES 4
> +#define I_DIRTY_RECLAIM 256
> #define __I_NEW 3
> #define I_NEW (1 << __I_NEW)
> #define I_WILL_FREE 16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
> struct writeback_control *wbc);
> long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
> void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>
> /* writeback.h requires fs.h; it, too, is not included from here. */
> static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
> * laptop mode avoiding disk spin-ups
> */
> if (file && nr_dirty_seen && sc->may_writepage)
> - wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> + page_list);
>
> *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> list_add(&page->lru, &putback_list);
> }
>
> - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> + &page_list);
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists