linux-kernel - Re: Free memory never fully used, swapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101125161238.GD26037@csn.ul.ie>
Date:	Thu, 25 Nov 2010 16:12:38 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Simon Kirby <sim@...tway.ca>
Cc:	Shaohua Li <shaohua.li@...el.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Dave Hansen <dave@...ux.vnet.ibm.com>
Subject: Re: Free memory never fully used, swapping

On Thu, Nov 25, 2010 at 01:03:28AM -0800, Simon Kirby wrote:
> > > <SNIP>
> > >
> > > This x86_64 box has 4 GB of RAM; zones are set up as follows:
> > > 
> > > [    0.000000] Zone PFN ranges:
> > > [    0.000000]   DMA      0x00000001 -> 0x00001000
> > > [    0.000000]   DMA32    0x00001000 -> 0x00100000
> > > [    0.000000]   Normal   0x00100000 -> 0x00130000
> > > ...
> > > [    0.000000] On node 0 totalpages: 1047279
> > > [    0.000000]   DMA zone: 56 pages used for memmap
> > > [    0.000000]   DMA zone: 0 pages reserved
> > > [    0.000000]   DMA zone: 3943 pages, LIFO batch:0
> > > [    0.000000]   DMA32 zone: 14280 pages used for memmap
> > > [    0.000000]   DMA32 zone: 832392 pages, LIFO batch:31
> > > [    0.000000]   Normal zone: 2688 pages used for memmap
> > > [    0.000000]   Normal zone: 193920 pages, LIFO batch:31
> > > 
> > > So, "Normal" is relatively small, and DMA32 contains most of the RAM.

Ok. A consequence of this is that kswapd balancing a node will still try
to balance Normal even if DMA32 has enough memory. This could account
for some of kswapd being mean.

> > > Watermarks from /proc/zoneinfo are:
> > > 
> > > Node 0, zone      DMA
> > >         min      7
> > >         low      8
> > >         high     10
> > >         protection: (0, 3251, 4009, 4009)
> > > Node 0, zone    DMA32
> > >         min      1640
> > >         low      2050
> > >         high     2460
> > >         protection: (0, 0, 757, 757)
> > > Node 0, zone   Normal
> > >         min      382
> > >         low      477
> > >         high     573
> > >         protection: (0, 0, 0, 0)
> > > 
> > > This box has a couple bnx2 NICs, which do about 60 Mbps each.  Jumbo
> > > frames were disabled for now (to try to stop big order allocations), but
> > > this did not stop atomic allocations of order 3 coming in, as found with:
> > > 
> > > perf record --event kmem:mm_page_alloc --filter 'order>=3' -a --call-graph -c 1 -a sleep 10
> > > perf report
> > > 
> > > __alloc_pages_nodemask
> > > alloc_pages_current
> > > new_slab
> > > __slab_alloc
> > > __kmalloc_node_track_caller
> > > __alloc_skb
> > > __netdev_alloc_skb
> > > bnx2_poll_work
> > > 
> > > From my reading of this, it seems like __alloc_skb uses kmalloc(), and
> > > kmalloc uses the kmalloc slab unless (unlikely(size > SLUB_MAX_SIZE)),
> > > where SLUB_MAX_SIZE is 2 * PAGE_SIZE, in which case kmalloc_large is
> > > called which allocates pages directly.  This means that reception of
> > > jumbo frames probably actually results in (consistent) smaller order
> > > allocations!  Anyway, these GFP_ATOMIC allocations don't seem to be
> > > failing, BUT...
> > > 

It's possible to reduce the maximum order that SLUB uses but lets not
resort to that as a workaround just yet. In case it needs to be
elminiated as a source of problems later, the relevant kernel parameter
is slub_max_order=.

> > > Right after kswapd goes to sleep, we're left with DMA32 with 421k or so
> > > free pages, and Normal with 20k or so free pages (about 1.8 GB free).
> > > 
> > > Immediately, zone Normal starts being used until it reaches about 468
> > > pages free in order 0, nothing else free.  kswapd is not woken here,
> > > but allocations just start coming from zone DMA32 instead. 

kswapd is not woken up because we stay in the allocator fastpath once
that much memory hs been freed.

> > > While this
> > > happens, the occasional order=3 allocations coming in via the slab from
> > > __alloc_skb seem to be picking away at the available order=3 chunks.
> > > /proc/buddyinfo shows that there are 10k or so when it starts, so this
> > > succeeds easily.
> > > 
> > > After a minute or so, available order-3 start reaching a lower number,
> > > like 20 or so.  order-4 then starts dropping as it is split into order-3,
> > > until it reaches 20 or so as well.  Then, order-3 hits 0, and kswapd is
> > > woken. 

Allocator slowpath.

> > > When this occurs, there are still a few order-5, order-6, etc.,
> > > available. 

Watermarks are probably not met though.

> > > I presume the GFP_ATOMIC allocation can still split buddies
> > > here, still making order-3 available without sleeping, because there is
> > > no allocation failure message that I can see.
> > > 

Technically it could, but watermark maintenance is important.

> > > Here is a "while true; do sleep 1; grep -v 'DMA ' /proc/buddyinfo; done"
> > > ("DMA" zone is totally untouched, always, so excluded; white space
> > > crushed to avoid wrapping), while it happens:
> > > 
> > > Node 0, zone      DMA      2      1      1      2      1     1 1 0 1 1 3
> > > Node 0, zone    DMA32  25770  29441  14512  10426   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > ...
> > > Node 0, zone    DMA32  23343  29405   6062   6478   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  23187  29358   6047   5960   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  23000  29372   6047   5411   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  22714  29391   6076   4225   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  22354  29459   6059   3178   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  22202  29388   6035   2395   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  21971  29411   6036   1032   1901   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  21514  29388   6019    433   1796   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  21334  29387   6019    240   1464   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  21237  29421   6052    216   1336   123 4 0 0 0 0
> > > Node 0, zone   Normal    455      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  20968  29378   6020    244    751   123 4 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  20741  29383   6022    134    272   123 4 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  20476  29370   6024    117     48   116 4 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  20343  29369   6020    110     23    10 2 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  21592  30477   4856     22     10     4 2 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  24388  33261   1985      6     10     4 2 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  25358  34080   1068      0      4     4 2 0 0 0 0
> > > Node 0, zone   Normal    453      1      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  75985  68954   5345     87      1     4 2 0 0 0 0
> > > Node 0, zone   Normal  18249      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81117  71630  19261    429      3     4 2 0 0 0 0
> > > Node 0, zone   Normal  17908      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81226  71299  21038    569     19     4 2 0 0 0 0
> > > Node 0, zone   Normal  18559      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81347  71278  21068    640     19     4 2 0 0 0 0
> > > Node 0, zone   Normal  17928     21      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81370  71237  21241   1073     29     4 2 0 0 0 0
> > > Node 0, zone   Normal  18187      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81401  71237  21314   1139     29     4 2 0 0 0 0
> > > Node 0, zone   Normal  16978      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81410  71239  21314   1145     29     4 2 0 0 0 0
> > > Node 0, zone   Normal  18156      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81419  71232  21317   1160     30     4 2 0 0 0 0
> > > Node 0, zone   Normal  17536      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81347  71144  21443   1160     31     4 2 0 0 0 0
> > > Node 0, zone   Normal  18483      7      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81300  71059  21556   1178     38     4 2 0 0 0 0
> > > Node 0, zone   Normal  18528      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81315  71042  21577   1180     39     4 2 0 0 0 0
> > > Node 0, zone   Normal  18431      2      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81301  71002  21702   1202     39     4 2 0 0 0 0
> > > Node 0, zone   Normal  18487      5      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81301  70998  21702   1202     39     4 2 0 0 0 0
> > > Node 0, zone   Normal  18311      0      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81296  71025  21711   1208     45     4 2 0 0 0 0
> > > Node 0, zone   Normal  17092      5      0      0      0     0 0 0 0 0 0
> > > Node 0, zone    DMA32  81299  71023  21716   1226     45     4 2 0 0 0 0
> > > Node 0, zone   Normal  18225     12      0      0      0     0 0 0 0 0 0
> > > 
> > > Running a perf record on the kswapd wakeup right when it happens shows:
> > > perf record --event vmscan:mm_vmscan_wakeup_kswapd -a --call-graph -c 1 -a sleep 10
> > > perf trace
> > >          swapper-0     [002] 1323136.979119: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > >          swapper-0     [002] 1323136.979131: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > >             lmtp-20593 [003] 1323136.984066: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > >             lmtp-20593 [003] 1323136.984079: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > >          swapper-0     [001] 1323136.985511: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > >          swapper-0     [001] 1323136.985515: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > >             lmtp-20593 [003] 1323136.985673: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> > >             lmtp-20593 [003] 1323136.985675: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> > > 
> > > This causes kswapd to throw out a bunch of stuff from Normal and from
> > > DMA32, to try to get zone_watermark_ok() to be happy for order=3.

Yep.

> > > However, we have a heavy read load from all of the email stored on SSDs
> > > on this box, and kswapd ends up fighting to try to keep reclaiming the
> > > allocations (mostly order-0).  During the whole day, it never wins -- the
> > > allocations are faster.  At night, it wins after a minute or two.  The
> > > fighting is happening in all of the lines after it awakes above.
> > > 

It's probably fighting to keep *all* zones happy even though it's not strictly
necessary. I suspect it's fighting the most for Normal.

> > > slabs_scanned, kswapd_steal, kswapd_inodesteal (slowly),
> > > kswapd_skip_congestion_wait, and pageoutrun go up in vmstat while kswapd
> > > is running.  With the box up for 15 days, you can see it struggling on
> > > pgscan_kswapd_normal (from /proc/vmstat):
> > > 
> > > pgfree 3329793080
> > > pgactivate 643476431
> > > pgdeactivate 155182710
> > > pgfault 2649106647
> > > pgmajfault 58157157
> > > pgrefill_dma 0
> > > pgrefill_dma32 19688032
> > > pgrefill_normal 7600864
> > > pgrefill_movable 0
> > > pgsteal_dma 0
> > > pgsteal_dma32 465191578
> > > pgsteal_normal 651178518
> > > pgsteal_movable 0
> > > pgscan_kswapd_dma 0
> > > pgscan_kswapd_dma32 768300403
> > > pgscan_kswapd_normal 34614572907
> > > pgscan_kswapd_movable 0
> > > pgscan_direct_dma 0
> > > pgscan_direct_dma32 2853983
> > > pgscan_direct_normal 885799
> > > pgscan_direct_movable 0
> > > pginodesteal 191895
> > > pgrotated 27290463
> > > 
> > > So, here are my questions.
> > > 
> > > Why do we care about order > 0 watermarks at all in the Normal zone?
> > > Wouldn't it make a lot more sense to just make the DMA32 zone the only
> > > one we care about for larger-order allocations?  Or is this required for
> > > the hugepage stuff?
> > > 

It's not required. The logic for kswapd is "balance all zones" and
Normal is one of the zones. Even though you know that DMA32 is just
fine, kswapd doesn't.

> > > The fact that so much stuff is evicted just because order-3 hits 0 is
> > > crazy, especially when larger order pages are still free.  It seems like
> > > we're trying to keep large orders free here.  Why? 

Watermarks. The steady stream of order-3 allocations is telling the
allocator and kswapd that these size pages must be available. It doesn't
know that slub can happily fall back to smaller pages because that
information is lost. Even removing __GFP_WAIT won't help because kswapd
still gets woken up for atomic allocation requests.

> > > Maybe things would be
> > > better if kswapd does not reclaim at all unless the requested order is
> > > empty _and_ all orders above are empty.  This would require hugepage
> > > users to use CONFIG_COMPACT, and have _compaction_ occur the way the
> > > watermark checks work now, but people without CONFIG_HUGETLB_PAGE could
> > > just actually use the memory.  Would this work?
> > > 
> > > There is logic at the end of balance_pgdat() to give up balancing order>0
> > > and just try another loop with order = 0 if sc.nr_reclaimed is <
> > > SWAP_CLUSTER_MAX.  However, when this order=0 pass returns, the caller of
> > > balance_pgdat(), kswapd(), gets true from sleeping_prematurely() and just
> > > calls right back to balance_pgdat() again.  I think this is why this
> > > logic doesn't seem to work here.
> > > 

Ok, this is true. kswapd in balance_pgdat() has given up on the order
but that information is lost when sleeping_prematurely() is called so it
constantly loops. That is a mistake. balance_pgdat() could return the order
so sleeping_prematurely() doesn't do the wrong thing.

> > > Is my assumption about GFP_ATOMIC order=3 working even when order 3 is
> > > empty, but order>3 is not?  Regardless, shouldn't kswapd be woken before
> > > order 3 is 0 since it may have nothing above order 3 to split from, thus
> > > actually causing an allocation failure?  Does something else do this?
> > 
> > even kswapd is woken after order>3 is empty, the issue will occur since
> > the order > 3 pages will be used soon and kswapd still needs to reclaim
> > some pages. So the issue is there is high order page allocation and
> > lumpy reclaim wrongly reclaims some pages. maybe you should use slab
> > instead of slub to avoid high order allocation.
> 
> There are actually a few problems here.  I think they are worth looking
> at them separately, unless "don't use order 3 allocations" is a valid
> statement, in which case we should fix slub.
> 

SLUB can be forced to use smaller orders but I don't think that's the
right fix here.

> The funny thing here is that slub.c's allocate_slab() calls alloc_pages()
> with flags | __GFP_NOWARN | __GFP_NORETRY, and intentionally tries a
> lower order allocation automatically if it fails.  This is why there is
> no allocation failure warning when this happens.  However, it is too late
> -- kswapd is woken and it ties to bring order 3 up to the watermark. 
> If we hacked __alloc_pages_slowpath() to not wake kswapd when
> __GFP_NOWARN is set, we would never see this problem and the slub
> optimization might still mostly work. 

Yes, but we'd see more high-order atomic allocation (e.g. jumbo frames)
failures as a result so that fix would cause other regressions.

> Either way, we should "fix" slub
> or "fix" order-3 allocations, so that other people who are using slub
> don't hit the same problem.
> 
> kswapd is throwing out many times what is needed for the order 3
> watermark to be met.  It seems to be not as bad now, but look at these
> pages being reclaimed (200ms intervals, whitespace-packed buddyinfo
> followed by nr_pages_free calculation and final order-3 watermark test,
> kswapd woken after the second sample):
> 
>   Zone order:0      1     2     3    4   5  6 7 8 9 A nr_free or3-low-chk
> 
>  DMA32   20374  35116   975     1    2   5  1 0 0 0 0   94770 257 <= 256
>  DMA32   20480  35211   870     1    1   5  1 0 0 0 0   94630 241 <= 256
> (kswapd wakes, gobble gobble)
>  DMA32   24387  37009  2910   297  100   5  1 0 0 0 0  114245 4193 <= 256
>  DMA32   36169  37787  4676   637  110   5  1 0 0 0 0  137527 7073 <= 256
>  DMA32   63443  40620  5716   982  144   5  1 0 0 0 0  177931 10377 <= 256
>  DMA32   65866  57006  6462  1180  158   5  1 0 0 0 0  217918 12185 <= 256
>  DMA32   67188  66779  9328  1893  208   5  1 0 0 0 0  256754 18689 <= 256
>  DMA32   67909  67356 18307  2268  235   5  1 0 0 0 0  297977 22121 <= 256
>  DMA32   68333  67419 20786  4192  298   7  1 0 0 0 0  324907 38585 <= 256
>  DMA32   69872  68096 21580  5141  326   7  1 0 0 0 0  339016 46625 <= 256
>  DMA32   69959  67970 22339  5657  371  10  1 0 0 0 0  346831 51569 <= 256
>  DMA32   70017  67946 22363  6078  417  11  1 0 0 0 0  351073 55705 <= 256
>  DMA32   70023  67949 22376  6204  439  12  1 0 0 0 0  352529 57097 <= 256
>  DMA32   70045  67937 22380  6262  451  12  1 0 0 0 0  353199 57753 <= 256
>  DMA32   70062  67939 22378  6298  456  12  1 0 0 0 0  353580 58121 <= 256
>  DMA32   70079  67959 22388  6370  458  12  1 0 0 0 0  354285 58729 <= 256
>  DMA32   70079  67959 22388  6387  460  12  1 0 0 0 0  354453 58897 <= 256
>  DMA32   70076  67954 22387  6393  460  12  1 0 0 0 0  354484 58945 <= 256
>  DMA32   70105  67975 22385  6466  468  12  1 0 0 0 0  355259 59657 <= 256
>  DMA32   70110  67972 22387  6466  470  12  1 0 0 0 0  355298 59689 <= 256
>  DMA32   70152  67989 22393  6476  470  12  1 0 0 0 0  355478 59769 <= 256
>  DMA32   70175  67991 22401  6493  471  12  1 0 0 0 0  355689 59921 <= 256
>  DMA32   70175  67991 22401  6493  471  12  1 0 0 0 0  355689 59921 <= 256
>  DMA32   70175  67991 22401  6493  471  12  1 0 0 0 0  355689 59921 <= 256
>  DMA32   70192  67990 22401  6495  471  12  1 0 0 0 0  355720 59937 <= 256
>  DMA32   70192  67988 22401  6496  471  12  1 0 0 0 0  355724 59945 <= 256
>  DMA32   70099  68061 22467  6602  477  12  1 0 0 0 0  356985 60889 <= 256
>  DMA32   70099  68062 22467  6602  477  12  1 0 0 0 0  356987 60889 <= 256
>  DMA32   70099  68062 22467  6602  477  12  1 0 0 0 0  356987 60889 <= 256
>  DMA32   70099  68062 22467  6603  477  12  1 0 0 0 0  356995 60897 <= 256
> (kswapd sleeps)
> 
> Normal zone at the same time (shown separately for clarity):
> 
> Normal     452      1     0     0    0   0  0 0 0 0 0     454 -5 <= 238
> Normal     452      1     0     0    0   0  0 0 0 0 0     454 -5 <= 238
> (kswapd wakes)
> Normal    7618     76     0     0    0   0  0 0 0 0 0    7770 145 <= 238
> Normal    8860     73     1     0    0   0  0 0 0 0 0    9010 143 <= 238
> Normal    8929     25     0     0    0   0  0 0 0 0 0    8979 43 <= 238
> Normal    8917      0     0     0    0   0  0 0 0 0 0    8917 -7 <= 238
> Normal    8978     16     0     0    0   0  0 0 0 0 0    9010 25 <= 238
> Normal    9064      4     0     0    0   0  0 0 0 0 0    9072 1 <= 238
> Normal    9068      2     0     0    0   0  0 0 0 0 0    9072 -3 <= 238
> Normal    8992      9     0     0    0   0  0 0 0 0 0    9010 11 <= 238
> Normal    9060      6     0     0    0   0  0 0 0 0 0    9072 5 <= 238
> Normal    9010      0     0     0    0   0  0 0 0 0 0    9010 -7 <= 238
> Normal    8907      5     0     0    0   0  0 0 0 0 0    8917 3 <= 238
> Normal    8576      0     0     0    0   0  0 0 0 0 0    8576 -7 <= 238
> Normal    8018      0     0     0    0   0  0 0 0 0 0    8018 -7 <= 238
> Normal    6778      0     0     0    0   0  0 0 0 0 0    6778 -7 <= 238
> Normal    6189      0     0     0    0   0  0 0 0 0 0    6189 -7 <= 238
> Normal    6220      0     0     0    0   0  0 0 0 0 0    6220 -7 <= 238
> Normal    6096      0     0     0    0   0  0 0 0 0 0    6096 -7 <= 238
> Normal    6251      0     0     0    0   0  0 0 0 0 0    6251 -7 <= 238
> Normal    6127      0     0     0    0   0  0 0 0 0 0    6127 -7 <= 238
> Normal    6218      1     0     0    0   0  0 0 0 0 0    6220 -5 <= 238
> Normal    6034      0     0     0    0   0  0 0 0 0 0    6034 -7 <= 238
> Normal    6065      0     0     0    0   0  0 0 0 0 0    6065 -7 <= 238
> Normal    6189      0     0     0    0   0  0 0 0 0 0    6189 -7 <= 238
> Normal    6189      0     0     0    0   0  0 0 0 0 0    6189 -7 <= 238
> Normal    6096      0     0     0    0   0  0 0 0 0 0    6096 -7 <= 238
> Normal    6127      0     0     0    0   0  0 0 0 0 0    6127 -7 <= 238
> Normal    6158      0     0     0    0   0  0 0 0 0 0    6158 -7 <= 238
> Normal    6127      0     0     0    0   0  0 0 0 0 0    6127 -7 <= 238
> (kswapd sleeps -- maybe too much turkey)
> 
> DMA32 get so much reclaimed that the watermark test succeeded long ago.
> Meanwhile, Normal is being reclaimed as well, but because it's fighting
> with allocations, it tries for a while and eventually succeeds (I think),
> but the 200ms samples didn't catch it.
> 

So, the key here is kswapd didn't need to balance all zones, any one of
them would have been fine.

> KOSAKI Motohiro, I'm interested in your commit 73ce02e9.  This seems
> to be similar to this problem, but your change is not working here. 

It's not because sleeping_prematurely() interferes with it.

> We're seeing kswapd run without sleeping, KSWAPD_SKIP_CONGESTION_WAIT
> is increasing (so has_under_min_watermark_zone is true), and pageoutrun
> increasing all the time.  This means that balance_pgdat() keeps being
> called, but sleeping_prematurely() is returning true, so kswapd() just
> keeps re-calling balance_pgdat().  If your approach is correct to stop
> kswapd here, the problem seems to be that balance_pgdat's copy of order
> and sc.order is being set to 0, but not pgdat->kswapd_max_order, so
> kswapd never really sleeps.  How is this supposed to work?
> 

It doesn't.

> Our allocation load here is mostly file pages, some anon pages, and
> relatively little slab and anything else.
> 

I think there are at least two fixes required here.

1. sleeping_prematurely() must be aware that balance_pgdat() has dropped
   the order.
2. kswapd is trying to balance all zones for higher orders even though
   it doesn't really have to.

This patch has potential fixes for both of these problems. I have a split-out
series but I'm posting it as a single patch so see if it allows kswapd to
go to sleep as expected for you and whether it stops hammering the Normal
zone unnecessarily. I tested it locally here (albeit with compaction
enabled) and it did reduce the amount of time kswapd spent awake.

==== CUT HERE ====
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..25fe08d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+	enum zone_type high_zoneidx;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
 
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..344b597 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 	struct zone *zone;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone, order, high_zoneidx);
 }
 
 static inline int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..00529a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 #endif
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
+	bool all_zones_ok = true;
+	bool any_zone_ok = false;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
-	/* If after HZ/10, a zone is below the high mark, it's premature */
+	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
-			return 1;
+			all_zones_ok = false;
+		else
+			any_zone_ok = true;
 	}
 
-	return 0;
+	/*
+	 * For high-order requests, any zone meeting the watermark is enough
+	 *   to allow kswapd go back to sleep
+	 * For order-0, all zones must be balanced
+	 */
+	if (order)
+		return !any_zone_ok;
+	else
+		return !all_zones_ok;
 }
 
 /*
@@ -2168,6 +2180,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
+	int any_zone_ok;
 	int priority;
 	int i;
 	unsigned long total_scanned;
@@ -2201,6 +2214,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		any_zone_ok = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2310,10 +2324,12 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				if (i <= pgdat->high_zoneidx)
+					any_zone_ok = 1;
 			}
 
 		}
-		if (all_zones_ok)
+		if (all_zones_ok || (order && any_zone_ok))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2336,7 +2352,7 @@ loop_again:
 			break;
 	}
 out:
-	if (!all_zones_ok) {
+	if (!(all_zones_ok || (order && any_zone_ok))) {
 		cond_resched();
 
 		try_to_freeze();
@@ -2361,7 +2377,13 @@ out:
 		goto loop_again;
 	}
 
-	return sc.nr_reclaimed;
+	/*
+	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * makes a decision on the order we were last reclaiming at. However,
+	 * if another caller entered the allocator slow path while kswapd
+	 * was awake, order will remain at the higher level
+	 */
+	return order;
 }
 
 /*
@@ -2417,6 +2439,7 @@ static int kswapd(void *p)
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
 		pgdat->kswapd_max_order = 0;
+		pgdat->high_zoneidx = MAX_ORDER;
 		if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2464,7 +2487,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			order = balance_pgdat(pgdat, order);
 		}
 	}
 	return 0;
@@ -2473,7 +2496,7 @@ static int kswapd(void *p)
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx)
 {
 	pg_data_t *pgdat;
 
@@ -2483,8 +2506,10 @@ void wakeup_kswapd(struct zone *zone, int order)
 	pgdat = zone->zone_pgdat;
 	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
+	if (pgdat->kswapd_max_order < order) {
 		pgdat->kswapd_max_order = order;
+		pgdat->high_zoneidx = min(pgdat->high_zoneidx, high_zoneidx);
+	}
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/