linux-kernel - Re: [Bug #14141] order 2 page allocation failures in iwlagn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091012134328.GB8200@csn.ul.ie>
Date:	Mon, 12 Oct 2009 14:43:28 +0100
From:	Mel Gorman <mel@....ul.ie>
To:	Frans Pop <elendil@...net.nl>
Cc:	David Rientjes <rientjes@...gle.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Kernel Testers List <kernel-testers@...r.kernel.org>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Reinette Chatre <reinette.chatre@...el.com>,
	Bartlomiej Zolnierkiewicz <bzolnier@...il.com>,
	Karol Lewandowski <karol.k.lewandowski@...il.com>,
	Mohamed Abbas <mohamed.abbas@...el.com>,
	"John W. Linville" <linville@...driver.com>, linux-mm@...ck.org
Subject: Re: [Bug #14141] order 2 page allocation failures in iwlagn

On Mon, Oct 12, 2009 at 01:10:25AM +0200, Frans Pop wrote:
> Sorry for going quiet on this issue for a few days, but I have been 
> spending *a lot* of time on it. I've done what amounts to 5 bisection 
> rounds at ~20 minutes per iteration and in total over 80 boots.
> 
> The problem with my first bisection was that there are *at least two* 
> changes at the root of this issue, both committed between .30 and .30-rc1. 
> Because of this a normal bisection will not lead to a reliable result and 
> even with my last effort I can only narrow it down to two different areas, 
> and not 100% to specific commits.
> 

Thanks very much for your detailed work on this.

> The two identified areas are:
> 1) a wireless merge which causes the SKB errors to appear in the first
>    place, but not always;
> 2) an mm merge which makes the SKB errors occur *much* quicker; IMHO this
>    is the change that also causes the regressions reported by Pekka and
>    Karol.
> 
> So below my results. The issue is both complex and subtle. Now it's up to 
> you, domain experts for both mm *and* wireless/networking, to make sense of 
> it all and come up with suggestions on how to proceed.
> 
> I've improved my test and it's now a lot more reliable, but there are still 
> timing influences.

The timing influences is probably because kswapd is working from the
time memory gets full. High-order allocation failures would cause it to
start reclaiming at that order so it's a race always to see can it do
its work before an atomic allocation fails or not.

> Also, because this is all merge-window stuff, I'm 
> hitting quite a few minor and major regressions between commits that can 
> affect tests.
> 
> Please study the information below carefully. I know it's long, but I think 
> this issue justifies that.
> 

Agreed. I'll be looking at commits, both wireless and mm but obviously
anything I saw about wireless needs to be taken with a generous dose of
salt.

> On Monday 05 October 2009, Frans Pop wrote:
> > This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> > only starting gitk. I only started music playing in the background
> > (amarok) from an NFS share to ensure network activity.
> >
> > With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> > gitk instance. And the system was completely frozen with music stopped
> > until gitk finished loading.
> 
> With .32-rc3, .31.1 and vanilla .31 I will get multiple SKB allocation 
> errors the *first time* I run the test, *every* time.
> 

So, this remains a current problem that wasn't solved by accident.

> > With .30 I was able to start *three* gitk's (which meant 2 of them got
> > (partially) swapped out) without any allocation errors. And with the
> > system remaining relatively responsive. There was a short break in the
> > music while I started the 2nd instance, but it just continued playing
> > afterwards. There was also some mild latency in the mouse cursor, but
> > nothing like the full desktop freeze I get with .32-rc3.
> 
> With both .30.2 and vanilla .30 I have *never* been able to get any SKB 
> allocation errors. No matter how often I repeat the test.
> 
> So, the start and end position are 100% reproducible. Problem is that this 
> changes during the bisection. At some point the test will fail (no SKB 
> errors) the first time I run it, but it will fail on the second or third 
> attempt.
> Apparently at some point memory must already be fragmented (or higher 
> orders already used up) to some extend for the errors to trigger.
> 

That is a reasonable assessment. It could be because

1. Something in the intevening commits greatly increases the number of
   GFP_ATOMIC allocations that are occuring. It's a pity that the allocator
   tracepoints are not available in those kernels. It would have made
   investigating this theory easier.

2. kswapd is no longer reclaiming high-order pages as well as it used
   to be it due to changes in kswapd itself or lumpy reclaim

3. Fragmentation avoidance has been broken in some subtle manner

I think 3 is particularly unlikely and am expecting it to be 1 or 2.

> TEST METHOD
> -----------
> As a normal bisection (I tried 3 times...) did not lead anywhere, I had to 
> think of an alternative approach. I decided to start by manually selecting 
> merges by Linus into mainline. The advantage is that that makes the 
> bisection linear and makes it a lot easier to see patterns.
> After narrowing down to a specific merge, I bisected (again semi-manually) 
> inside that merge.
> 
> Because I suspected there were multiple changes involved, I deliberately 
> tried to find two points:
> - where do I first start seeing SKB errors at all, even if it is only at
>   the second or third try;
> - where do I start getting SKB errors reliably on the first try.
> 
> I worked from "good" to "bad", i.e. I started at .30. The merges were not 
> chosen completely randomly. From the first 3 bisections I strongly 
> suspected the first 'net-next' merge and the first 'akpm' merge, but I did 
> make sure to confirm that suspicion.
> 

A very good approach.

> TEST DESCRIPTION
> ----------------
> The test I've ended up using is:
> 1) clean boot
> 2) start music in amarok from NFS share; use very long song to avoid file
>    changes and thus ensure a fluent stream of network data during the test
> 3) start 'gitk v2.6.29..master &' - to use up some memory
> 4) start first 'gitk master &' - after this all normal memory is as good as
>    used up, with minor swap; this never resulted in SKB errors
> 5) start second 'gitk master &' - this causes heavy swapping (>700 MB) and
>    is the real test
> 6) if there were no SKB errors after 5), kill the gitk processes and repeat
>    steps 3) to 5). I've done this up to 4 times in some cases
> 7) if the results are not clear or when there is doubt later, repeat from
>    step 1) with same kernel
> 
> Memory after initial 'gitk v2.6.29..master &':
>              total       used       free     shared    buffers     cached
> Mem:       2030776    1153008     877768          0      41572     333968
> -/+ buffers/cache:     777468    1253308
> Swap:      2097144          0    2097144
> 
> Memory after first 'gitk master &':
>              total       used       free     shared    buffers     cached
> Mem:       2030776    1979040      51736          0      35684     238420
> -/+ buffers/cache:    1704936     325840
> Swap:      2097144      21876    2075268
> 
> Memory after second 'gitk master &' (with .30.2):
>              total       used       free     shared    buffers     cached
> Mem:       2030776    2011608      19168          0      21836      92336
> -/+ buffers/cache:    1897436     133340
> Swap:      2097144     776160    1320984
> 
> OVERVIEW OF RESULTS
> -------------------
> Below I list the most relevant merges and commits. Note that they are 
> listed in commit order; my kernel version shows the order of testing.
> 
> For the commits I tested the test results are listed on the next line.
> The first number on that line consists of the test series + the iteration 
> (and also identifies the kernel I used).
> A "+" means I got no SKB errors, a "-" that I did get them. A "|" means I 
> rebooted for a second series of tests.
> 
> v2.6.30-2330-gdb8e7f1	'x86-fixes-for-linus' of linux-2.6-tip
> 	1.1	+++	iwlagn sw-error during first test
> v2.6.30-4127-g0fa2133	'merge' of powerpc (last merge before net-next-2.6)
> 	1.2	+++
> v2.6.30-5398-g2ed0e21	net-next-2.6 (mega-merge!)
> 	1.4	+-	system reboot fails after testing
> v2.6.30-5517-g609106b	'merge' of powerpc
> 	1.3	+-	system reboot fails after testing
> v2.6.30-5927-gf83b1e6	'for-linus' of linux1394-2.6 (last merge before akpm)
> 	2.2	++-
> v2.6.30-6111-g517d086	'akpm'
> 	2.1	-|-
> 
> BISECTION OF net-next-2.6 MERGE
> -------------------------------
> Note that this merge was based not on .30 vanilla, but partly on 
> v2.6.30-rc1 and partly on v2.6.30-rc6.
> I think this had an influence on the latencies I saw (i.e. because some 
> post-rc6 bug fixes were not present it changes the general behavior of the 
> system during the swapping). For example: with v2.6.30-4127-g0fa2133 the 
> system remained more responsive (smaller music skips) than with 
> v2.6.30-rc1-1219-g82d0481.
> 
> I started again by testing merges, this time those by David.
> 
> v2.6.30-rc1-1219-g82d0481	'master' of wireless-next-2.6
> 	1.5	++++	bad latencies

The bad latencies might imply that there are a lot more allocations
going on than there used to be. Maybe it was just because of a wireless
bug though that was later fixed.

> v2.6.30-rc6-660-gbb803cf	'master' of net-2.6
> v2.6.30-rc6-808-g45ea4ea	'master' of wireless-next-2.6
> v2.6.30-rc6-850-gc649c0e	'master' of net-2.6
> v2.6.30-rc6-922-g3f1f39c	'linux-2.6.31.y' of wimax
> v2.6.30-rc6-999-gb2f8f75	'master' of net-2.6
> v2.6.30-rc6-1028-ga8c617e	'net-next' of lksctp-dev
> 	1.7	++++|++++|++++
> 	I went back to this one twice because the bisection inside the
> 	next merge (see below) did not give a clear result.
> v2.6.30-rc6-1103-gb1bc81a	'master' of wireless-next-2.6
> 	1.8	+-
> v2.6.30-rc6-1224-g84503dd	'master' of wireless-next-2.6
> 	1.6	+-
> 
> So the problem started in the v2.6.30-rc6-1103-gb1bc81a merge.
> I was unable to narrow it down to an exact commit; AFAICT the remaining 
> ones (between v2.6.30-rc6-1028-g8fc0fee and v2.6.30-rc6-1032-g7ba10a8) are 
> uninteresting. But it *must* be in this area!
> 
> For a good overview of the area, use 'gitk 3f1f39c4..b1bc81a0'.
> 
> v2.6.30-rc6-1028-g8fc0fee	cfg80211: use key size constants
> 	1.11	++++
> v2.6.30-rc6-1031-g1bb5633	iwmc3200wifi: fix printk format
> 	1.14	+++-	not quite conclusive...
> v2.6.30-rc6-1032-g7ba10a8	mac80211: fix transposed min/max CW values
> 	1.13	-
> 	This is a bugfix for aa837ee1d from an earlier merge! Could this maybe
> 	influence the test results in between? There are various SKB related
> 	changes there, for example: dfbf97f3..e5b9215e.

Maybe. Your commit id's are different to what I see. Maybe it's because your
tree has been shuffled around a bit but after some digging around in this
general area, I saw this patch

4752c93c30 iwlcore: Allow skb allocation from tasklet

This patch increases the number of GFP_ATOMIC allocations that can occur by
allocating GFP_ATOMIC in some cases and GFP_KERNEL in others. Previously,
only GFP_KERNEL was used and I didn't realise this allocation method was
so recent. Problems of this sort have cropped up before and while there
are later changes that suppress some of these warnings, I believe this is
a strong candidate for where the allocation failures started appearing.

> v2.6.30-rc6-1037-g2c5b9e5	wireless: libertas: fix unaligned accesses
> 	1.12	+-
> v2.6.30-rc6-1044-g729e9c7	cfg80211: fix for duplicate userspace replies
> 	1.10	+-
> v2.6.30-rc6-1075-gc587de0	iwlwifi: unify station management
> 	1.9	++-|+-
> v2.6.30-rc6-1076-gd14d444	iwl3945: port allow skb allocation in tasklet
> 	I thought this was a prime candidate, but as you can see several commits
> 	before failed too. Still worth looking at I think!
> 

Your commit IDs are different to what I see but it's the commit merge at
b1bc81a0ef86b86fa410dd303d84c8c7bd09a64d. I agree that the last commit
(d14d44407b9f06e3cf967fcef28ccb780caf0583) could make the problem worse
because it expands the use of GFP_ATOMIC for another driver.

> BISECTION of akpm (mm) MERGE
> ----------------------------
> So here I went looking for "where does the test start failing on the first 
> try". Again, I was unable to narrow it down to a single commit.
> 
> For a good overview of the area, use 'gitk f83b1e61..517d0869'.
> 
> v2.6.30-5466-ga1dd268	mm: use alloc_pages_exact in alloc_large_system_hash
> 	2.3	+-
> v2.6.30-5478-ge9bb35d	mm: setup_per_zone_inactive_ratio - fix comment and..
> 	2.5	+-
> v2.6.30-5486-g35282a2	migration: only migrate_prep() once per move_pages()
> 	2.6	-|+|-	not quite conclusive...
> v2.6.30-5492-gbce7394	page-allocator: reset wmark_min and inactive ratio..
> 	2.4	-|-
> 

While I didn't spot anything too out of the ordinary here, they did occur
shortly after a number of other page allocator related patches.  One small
thing I noticed there is that kswapd is getting woken up less now than it did
previously. Generally, I wouldn't have expected it to make a difference but
it's possible that kswapd is not being woken up to reclaim at a higher order
than it was previously. I have a patch for this below. It'd be nice if you
could apply it and see do fewer allocation failures occur on current mainline.

> WHERE NEXT?
> ===========
> I think the results confirm there is definitely an issue here and that my 
> test is reliable and consistent enough to show it. And as it currently is 
> the only test we have...
> 
> I hope that the info above is enough for the mm and wireless domain 
> experts to identify likely candidates in the areas I've identified.
> 
> The next step could be trying specific reverts or debug patches, either on 
> top of current git, or 2.6.31, or inside the identified areas.
> I'll run anything you care to throw at me and will try to provide any 
> additional info you need, but at this point it's up to you.
> 

For the wireless people in mainline - iwl_rx_replenish_now() is doing
a GFP_ATOMIC allocation that does not use __GFP_NOWARN. As part of
investigating allocation failures, iwl_rx_allocate() was taught to
distinguish between a benign and serious allocation failure - serious
being there are very few RX buffers left and packet loss could occur soon
(see commit f82a924cc88a5541df1d4b9d38a0968cd077a051). I think this GFP mask
should be made GFP_ATOMIC|__GFP_NOWARN so that warnings only appear when the
failure is serious, dump stack after the warning if you need it. I have a
feeling that almost all these warnings have been benign and are related to
the introduction of GFP_ATOMIC being used so heavily to move more expensive
allocations to the tasklet (presumably to reduce user-visible latency).

Frans, could you try the following kswapd-related patch please? I'd be
interested in seeing if the number of allocation failure warnings are
reduced with it. After that, could you edit
drivers/net/wireless/iwlwifi/iwl-rx.c and make the GFP_ATOMIC in
iwl_rx_replenish_now() GFP_ATOMIC|__GFP_NOWARN and see do any of the
"serious" allocation failure messages appear.

Thanks again for your persistence.

==== CUT HERE ====

>From 5296f50ce7ee6b276723ca21fa50d6db3d266075 Mon Sep 17 00:00:00 2001
From: Mel Gorman <mel@....ul.ie>
Date: Mon, 12 Oct 2009 14:21:52 +0100
Subject: [PATCH] page-allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

If a direct reclaim makes no forward progress, it considers whether it should
go OOM or not. Whether OOM is triggered or not, it may retry the application
afterwards. In times past, this would always wake kswapd as well but
currently, kswapd is not woken up after direct reclaim fails. For
order-0 allocations, this makes little difference but if there is a heavy
mix of higher-order allocations that direct reclaim is failing for, it
might mean that kswapd is not reclaiming for higher orders as much as it
did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <mel@....ul.ie>
--- 
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..dfa4362 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
+restart:
 	wake_all_kswapd(order, zonelist, high_zoneidx);
 
-restart:
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
 	 * reclaim. Now things get more complex, so set up alloc_flags according
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/