[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100916091824.GB15709@csn.ul.ie>
Date: Thu, 16 Sep 2010 10:18:24 +0100
From: Mel Gorman <mel@....ul.ie>
To: Minchan Kim <minchan.kim@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org,
Linux Kernel List <linux-kernel@...r.kernel.org>,
Johannes Weiner <hannes@...xchg.org>,
Wu Fengguang <fengguang.wu@...el.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Subject: Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not
being encountered in the current zone
On Thu, Sep 16, 2010 at 05:13:38PM +0900, Minchan Kim wrote:
> On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> > If wait_iff_congested() is called with no BDI congested, the function simply
> > calls cond_resched(). In the event there is significant writeback happening
> > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > would succeed once writeback was completed. Without any backoff logic,
> > younger clean pages can be reclaimed resulting in more reclaim overall and
> > poor performance.
>
> I agree.
>
> >
> > This patch tracks how many pages backed by a congested BDI were found during
> > scanning. If all the dirty pages encountered on a list isolated from the
> > LRU belong to a congested BDI, the zone is marked congested until the zone
>
> I am not sure it works well.
Check the competion times for the micro-mapped-file-stream benchmark in
the leader mail. Backing off like this is faster overall for some
workloads.
> We just met the condition once but we backoff it until high watermark.
Reaching the high watermark is considered to be a relieving of pressure.
> (ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
> First impression is rather _aggressive_.
>
Yes, it is. I intended to start with something quite aggressive that is
close to existing behaviour and then experiment with alternatives.
For example, I considered clearing zone congestion when but nr_bdi_congested
drops to 0. This would be less aggressive in terms of congestion waiting but
it is further from todays behaviour. I felt it would be best to introduce
wait_iff_congested() in one kernel cycle but wait to a later cycle to deviate
a lot from congestion_wait().
> How about more checking?
> For example, if above pattern continues repeately above some threshold,
> we can regard "zone is congested" and then if the pattern isn't repeated
> during some threshold, we can regard "zone isn't congested any more.".
>
I also considered these options and got stuck at what the "some
threshold" is and how to record the history. Should it be recorded on a
per BDI basis for example? I think all these questions can be answered
but should be in a different cycle.
> > reaches the high watermark. wait_iff_congested() then checks both the
> > number of congested BDIs and if the current zone is one that has encounted
> > congestion recently, it will sleep on the congestion queue. Otherwise it
> > will call cond_reched() to yield the processor if necessary.
> >
> > The end result is that waiting on the congestion queue is avoided when
> > necessary but when significant congestion is being encountered,
> > reclaimers and page allocators will back off.
> >
> > Signed-off-by: Mel Gorman <mel@....ul.ie>
> > ---
> > include/linux/backing-dev.h | 2 +-
> > include/linux/mmzone.h | 8 ++++
> > mm/backing-dev.c | 23 ++++++++----
> > mm/page_alloc.c | 4 +-
> > mm/vmscan.c | 83 +++++++++++++++++++++++++++++++++++++------
> > 5 files changed, 98 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 72bb510..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
>
> <snip>
>
> > struct scan_control *sc)
> > {
> > + enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > +
> > if (current->flags & PF_SWAPWRITE)
> > - return 1;
> > + return QUEUEWRITE_ALLOWED;
> > if (!bdi_write_congested(bdi))
> > - return 1;
> > + return QUEUEWRITE_ALLOWED;
> > + else
> > + ret = QUEUEWRITE_CONGESTED;
> > if (bdi == current->backing_dev_info)
> > - return 1;
> > + return QUEUEWRITE_ALLOWED;
> >
> > /* lumpy reclaim for hugepage often need a lot of write */
> > if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > - return 1;
> > - return 0;
> > + return QUEUEWRITE_ALLOWED;
> > + return ret;
> > }
>
> The function can't return QUEUEXXX_DENIED.
> It can affect disable_lumpy_reclaim.
>
Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
objection.
> >
> > /*
> > @@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
> > typedef enum {
> > /* failed to write page out, page is locked */
> > PAGE_KEEP,
> > + /* failed to write page out due to congestion, page is locked */
> > + PAGE_KEEP_CONGESTED,
> > /* move page to the active list, page is locked */
> > PAGE_ACTIVATE,
> > /* page has been sent to the disk successfully, page is unlocked */
> > @@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> > }
> > if (mapping->a_ops->writepage == NULL)
> > return PAGE_ACTIVATE;
> > - if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> > + switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
> > + case QUEUEWRITE_CONGESTED:
> > + return PAGE_KEEP_CONGESTED;
> > + case QUEUEWRITE_DENIED:
> > disable_lumpy_reclaim_mode(sc);
> > return PAGE_KEEP;
> > + case QUEUEWRITE_ALLOWED:
> > + ;
> > }
> >
> > if (clear_page_dirty_for_io(page)) {
> > @@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > + struct zone *zone,
> > struct scan_control *sc)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > + unsigned long nr_dirty = 0;
> > + unsigned long nr_congested = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > goto keep;
> >
> > VM_BUG_ON(PageActive(page));
> > + VM_BUG_ON(page_zone(page) != zone);
> >
> > sc->nr_scanned++;
> >
> > @@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + nr_dirty++;
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >
> > /* Page is dirty, try to write it out here */
> > switch (pageout(page, mapping, sc)) {
> > + case PAGE_KEEP_CONGESTED:
> > + nr_congested++;
> > case PAGE_KEEP:
> > goto keep_locked;
> > case PAGE_ACTIVATE:
> > @@ -903,6 +928,15 @@ keep_lumpy:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
> >
> > + /*
> > + * Tag a zone as congested if all the dirty pages encountered were
> > + * backed by a congested BDI. In this case, reclaimers should just
> > + * back off and wait for congestion to clear because further reclaim
> > + * will encounter the same problem
> > + */
> > + if (nr_dirty == nr_congested)
> > + zone_set_flag(zone, ZONE_CONGESTED);
> > +
> > free_page_list(&free_pages);
> >
> > list_splice(&ret_pages, page_list);
> > @@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, sc);
> > + nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> >
> > /* Check if we should syncronously wait for writeback */
> > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> > set_lumpy_reclaim_mode(priority, sc, true);
> > - nr_reclaimed += shrink_page_list(&page_list, sc);
> > + nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> > }
> >
> > local_irq_disable();
> > @@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >
> > /* Take a nap, wait for some writeback to complete */
> > if (!sc->hibernation_mode && sc->nr_scanned &&
> > - priority < DEF_PRIORITY - 2)
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + priority < DEF_PRIORITY - 2) {
> > + struct zone *active_zone = NULL;
> > + unsigned long max_writeback = 0;
> > + for_each_zone_zonelist(zone, z, zonelist,
> > + gfp_zone(sc->gfp_mask)) {
> > + unsigned long writeback;
> > +
> > + /* Initialise for first zone */
> > + if (active_zone == NULL)
> > + active_zone = zone;
> > +
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > + if (writeback > max_writeback) {
> > + max_writeback = writeback;
> > + active_zone = zone;
> > + }
> > + }
> > +
> > + wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > + }
>
> Other place just considers preferred zone.
> What is the rationale that consider max writeback zone in all zone of zonelist to
> call wait_iff_congeested?
Initially, it was because wait_iff_congested() heuristic was based on
writeback, not zone congestion. This time around, it was because I
wanted to be aggressive in terms of triggering the congestion wait to be
better than existing behaviour but not too far from it.
> Maybe max writeback zone can be much slow bdi but this process could be not related
> to the bdi. It can make random stall by point of view of this proces.
>
Fair point, I will retest using the preferred zone.
> > }
> >
> > out:
> > @@ -2251,6 +2303,15 @@ loop_again:
> > if (!zone_watermark_ok(zone, order,
> > min_wmark_pages(zone), end_zone, 0))
> > has_under_min_watermark_zone = 1;
> > + } else {
> > + /*
> > + * If a zone reaches its high watermark,
> > + * consider it to be no longer congested. It's
> > + * possible there are dirty pages backed by
> > + * congested BDIs but as pressure is relieved,
> > + * spectulatively avoid congestion waits
> > + */
> > + zone_clear_flag(zone, ZONE_CONGESTED);
> > }
> >
> > }
> > --
> > 1.7.1
> >
>
> --
> Kind regards,
> Minchan Kim
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists