[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100219151934.GA1445@csn.ul.ie>
Date: Fri, 19 Feb 2010 15:19:34 +0000
From: Mel Gorman <mel@....ul.ie>
To: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
Cc: Nick Piggin <npiggin@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
epasch@...ibm.com, SCHILLIG@...ibm.com,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de,
gregkh@...ell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
due to "e084b - page-allocator: preserve PFN ordering when
__GFP_COLD is set"
On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED fast path slow path
> >>>> GOOD CASE ~62 ~1.46
> >>>> BAD CASE ~62 ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that
> >>> we run in a scenario where try_to_free frees a lot of pages via, but
> >>> of the wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and
> >> memory is
> >> low" but you reported it made no difference even when fallback was
> >> allowed
> >> with high watermarks.
> >> [...]
> >
> > Today I created rather huge debug logs - I'll spare everyone with too
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output
> > extended to list all things per migrate type too.
> >
> > From that I still think there should be plenty of pages to get the
> > allocation through, as it was suggested by the high amount of pages
> > freed by try_to_free.
> > >
> [...]
>
> > More about that tomorrow,
>
> Well tomorrow is now, and I think I got some important new insights.
>
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
>
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
> get_page_from_freelist - zone loop - current zone 'DMA'
> get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
> get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
> get_page_from_freelist - return page '(null)'
>
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.
That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.
> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
>
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>
Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.
> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.
>
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c 2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c 2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
>
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + /*
> + * If it gets here try it without watermarks before going
> + * to sleep.
> + *
> + * This will end up in alloc_high_priority and if that fails
> + * once more direct_reclaim but this time without watermark
> + * checks.
> + *
> + * FIXME: that is just for verification - a real fix needs to
> + * ensure e.g. page cache allocations don't drain all pages
> + * under watermark
> + */
> + if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> + alloc_flags &= ALLOC_NO_WATERMARKS;
> + else
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> }
>
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.
It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.
> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21
Still a mystery.
> - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>
Not that I can think of.
> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>
In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.
> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.
I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.
When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists