linux-kernel - Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100219151934.GA1445@csn.ul.ie>
Date:	Fri, 19 Feb 2010 15:19:34 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
Cc:	Nick Piggin <npiggin@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	epasch@...ibm.com, SCHILLIG@...ibm.com,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de,
	gregkh@...ell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
	due to "e084b - page-allocator: preserve PFN ordering when
	__GFP_COLD is set"

On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED  fast path   slow path
> >>>> GOOD CASE      ~62       ~1.46
> >>>> BAD CASE       ~62       ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that 
> >>> we  run in a scenario where try_to_free frees a lot of pages via, but 
> >>> of the  wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and 
> >> memory is
> >> low" but you reported it made no difference even when fallback was 
> >> allowed
> >> with high watermarks.
> >> [...]
> > 
> > Today I created rather huge debug logs - I'll spare everyone with too 
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output 
> > extended to list all things per migrate type too.
> > 
> >  From that I still think there should be plenty of pages to get the 
> > allocation through, as it was suggested by the high amount of pages 
> > freed by try_to_free.
> > > 
> [...]
> 
> > More about that tomorrow,
> 
> Well tomorrow is now, and I think I got some important new insights.
> 
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
> 
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
>    get_page_from_freelist - zone loop - current zone 'DMA'
>    get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
>    get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
>    get_page_from_freelist - return page '(null)'
> 
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.

That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.

> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
> 
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>

Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.

> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.

> 
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c  2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c       2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
> 
>         if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>                 /* Wait for some write requests to complete then retry */
> -               congestion_wait(BLK_RW_ASYNC, HZ/50);
> +               /*
> +                * If it gets here try it without watermarks before going
> +                * to sleep.
> +                *
> +                * This will end up in alloc_high_priority and if that fails
> +                * once more direct_reclaim but this time without watermark
> +                * checks.
> +                *
> +                * FIXME: that is just for verification - a real fix needs to
> +                * ensure e.g. page cache allocations don't drain all pages
> +                * under watermark
> +                */
> +               if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> +                       alloc_flags &= ALLOC_NO_WATERMARKS;
> +               else
> +                       congestion_wait(BLK_RW_ASYNC, HZ/50);
>                 goto rebalance;
>         }
> 
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.

It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.

> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21

Still a mystery.

>  - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>

Not that I can think of.

> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>

In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.

> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.

I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.

When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/