linux-kernel - Re: [RFC][PATCH] mm: cut down __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 2 May 2011 18:29:45 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Minchan Kim <minchan.kim@...il.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mel@...ux.vnet.ibm.com>,
	Dave Young <hidave.darkstar@...il.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Christoph Lameter <cl@...ux.com>,
	Dave Chinner <david@...morbit.com>,
	David Rientjes <rientjes@...gle.com>,
	Li Shaohua <shaohua.li@...el.com>,
	Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation
 failures

Hi Minchan,

On Mon, May 02, 2011 at 12:35:42AM +0800, Minchan Kim wrote:
> Hi Wu,
> 
> On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
> > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > > > Test results:
> > > >
> > > > - the failure rate is pretty sensible to the page reclaim size,
> > > >   from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> > > >
> > > > - the IPIs are reduced by over 100 times
> > >
> > > It's reduced by 500 times indeed.
> > >
> > > CAL:     220449     220246     220372     220558     220251     219740     220043     219968   Function call interrupts
> > > CAL:         93        463        410        540        298        282        272        306   Function call interrupts
> > >
> > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > > > -------------------------------------------------------------------------------
> > > > nr_alloc_fail 10496
> > > > allocstall 1576602
> > >
> > > > patched (WMARK_MIN)
> > > > -------------------
> > > > nr_alloc_fail 704
> > > > allocstall 105551
> > >
> > > > patched (WMARK_HIGH)
> > > > --------------------
> > > > nr_alloc_fail 282
> > > > allocstall 53860
> > >
> > > > this patch (WMARK_HIGH, limited scan)
> > > > -------------------------------------
> > > > nr_alloc_fail 276
> > > > allocstall 54034
> > >
> > > There is a bad side effect though: the much reduced "allocstall" means
> > > each direct reclaim will take much more time to complete. A simple solution
> > > is to terminate direct reclaim after 10ms. I noticed that an 100ms
> > > time threshold can reduce the reclaim latency from 621ms to 358ms.
> > > Further lowering the time threshold to 20ms does not help reducing the
> > > real latencies though.
> >
> > Experiments going on...
> >
> > I tried the more reasonable terminate condition: stop direct reclaim
> > when the preferred zone is above high watermark (see the below chunk).
> >
> > This helps reduce the average reclaim latency to under 100ms in the
> > 1000-dd case.
> >
> > However nr_alloc_fail is around 5000 and not ideal. The interesting
> > thing is, even if zone watermark is high, the task still may fail to
> > get a free page..
> >
> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> >                         }
> >                 }
> >                 total_scanned += sc->nr_scanned;
> > -               if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > -                       goto out;
> > +               if (sc->nr_reclaimed >= min_reclaim) {
> > +                       if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > +                               goto out;
> > +                       if (total_scanned > 2 * sc->nr_to_reclaim)
> > +                               goto out;
> > +                       if (preferred_zone &&
> > +                           zone_watermark_ok_safe(preferred_zone, sc->order,
> > +                                       high_wmark_pages(preferred_zone),
> > +                                       zone_idx(preferred_zone), 0))
> > +                               goto out;
> > +               }
> >
> >                 /*
> >                  * Try to write back as many pages as we just scanned.  This
> >
> > Thanks,
> > Fengguang
> > ---
> > Subject: mm: cut down __GFP_NORETRY page allocation failures
> > Date: Thu Apr 28 13:46:39 CST 2011
> >
> > Concurrent page allocations are suffering from high failure rates.
> >
> > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> > the page allocation failures are
> >
> > nr_alloc_fail 733     # interleaved reads by 1 single task
> > nr_alloc_fail 11799   # concurrent reads by 1000 tasks
> >
> > The concurrent read test script is:
> >
> >       for i in `seq 1000`
> >       do
> >               truncate -s 1G /fs/sparse-$i
> >               dd if=/fs/sparse-$i of=/dev/null &
> >       done
> >
> > In order for get_page_from_freelist() to get free page,
> >
> > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
> >     current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
> >     possible low watermark state as well as fill the pcp with enough free
> >     pages to overflow its high watermark.
> >
> > (2) the get_page_from_freelist() _after_ direct reclaim should use lower
> >     watermark than its normal invocations, so that it can reasonably
> >     "reserve" some free pages for itself and prevent other concurrent
> >     page allocators stealing all its reclaimed pages.
> 
> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
> http://marc.info/?l=linux-mm&m=129187231129887&w=4
> The idea is to keep a page at leat for direct reclaimed process.
> Could it mitigate your problem or could you enhacne the idea?
> I think it's very simple and fair solution.

No it's not helping my problem, nr_alloc_fail and CAL are still high:

root@fat /home/wfg# ./test-dd-sparse.sh
start time: 246
total time: 531
nr_alloc_fail 14097
allocstall 1578332
LOC:     542698     538947     536986     567118     552114     539605     541201     537623   Local timer interrupts
RES:       3368       1908       1474       1476       2809       1602       1500       1509   Rescheduling interrupts
CAL:     223844     224198     224268     224436     223952     224056     223700     223743   Function call interrupts
TLB:        381         27         22         19         96        404        111         67   TLB shootdowns

root@fat /home/wfg# getdelays -dip `pidof dd`
print delayacct stats ON
printing IO accounting
PID     5202


CPU             count     real total  virtual total    delay total
                 1132     3635447328     3627947550   276722091605
IO              count    delay total  delay average
                    2      187809974             62ms
SWAP            count    delay total  delay average
                    0              0              0ms
RECLAIM         count    delay total  delay average
                 1334    35304580824             26ms
dd: read=278528, write=0, cancelled_write=0

I guess your patch is mainly fixing the high order allocations while
my workload is mainly order 0 readahead page allocations. There are
1000 forks, however the "start time: 246" seems to indicate that the
order-1 reclaim latency is not improved.

I'll try modifying your patch and see how it works out. The obvious
change is to apply it to the order-0 case. Hope this won't create much
more isolated pages.

Attached is your patch rebased to 2.6.39-rc3, after resolving some
merge conflicts and fixing a trivial NULL pointer bug.

> >
> > Some notes:
> >
> > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
> >   reclaim allocation fails") has the same target, however is obviously
> >   costly and less effective. It seems more clean to just remove the
> >   retry and drain code than to retain it.
> 
> Tend to agree.
> My old patch can solve it, I think.

Sadly nope. See above.

> >
> > - it's a bit hacky to reclaim more than requested pages inside
> >   do_try_to_free_page(), and it won't help cgroup for now
> >
> > - it only aims to reduce failures when there are plenty of reclaimable
> >   pages, so it stops the opportunistic reclaim when scanned 2 times pages
> >
> > Test results:
> >
> > - the failure rate is pretty sensible to the page reclaim size,
> >   from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> >
> > - the IPIs are reduced by over 100 times
> >
> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > -------------------------------------------------------------------------------
> > nr_alloc_fail 10496
> > allocstall 1576602
> >
> > slabs_scanned 21632
> > kswapd_steal 4393382
> > kswapd_inodesteal 124
> > kswapd_low_wmark_hit_quickly 885
> > kswapd_high_wmark_hit_quickly 2321
> > kswapd_skip_congestion_wait 0
> > pageoutrun 29426
> >
> > CAL:     220449     220246     220372     220558     220251     219740     220043     219968   Function call interrupts
> >
> > LOC:     536274     532529     531734     536801     536510     533676     534853     532038   Local timer interrupts
> > RES:       3032       2128       1792       1765       2184       1703       1754       1865   Rescheduling interrupts
> > TLB:        189         15         13         17         64        294         97         63   TLB shootdowns
> >
> > patched (WMARK_MIN)
> > -------------------
> > nr_alloc_fail 704
> > allocstall 105551
> >
> > slabs_scanned 33280
> > kswapd_steal 4525537
> > kswapd_inodesteal 187
> > kswapd_low_wmark_hit_quickly 4980
> > kswapd_high_wmark_hit_quickly 2573
> > kswapd_skip_congestion_wait 0
> > pageoutrun 35429
> >
> > CAL:         93        286        396        754        272        297        275        281   Function call interrupts
> >
> > LOC:     520550     517751     517043     522016     520302     518479     519329     517179   Local timer interrupts
> > RES:       2131       1371       1376       1269       1390       1181       1409       1280   Rescheduling interrupts
> > TLB:        280         26         27         30         65        305        134         75   TLB shootdowns
> >
> > patched (WMARK_HIGH)
> > --------------------
> > nr_alloc_fail 282
> > allocstall 53860
> >
> > slabs_scanned 23936
> > kswapd_steal 4561178
> > kswapd_inodesteal 0
> > kswapd_low_wmark_hit_quickly 2760
> > kswapd_high_wmark_hit_quickly 1748
> > kswapd_skip_congestion_wait 0
> > pageoutrun 32639
> >
> > CAL:         93        463        410        540        298        282        272        306   Function call interrupts
> >
> > LOC:     513956     510749     509890     514897     514300     512392     512825     510574   Local timer interrupts
> > RES:       1174       2081       1411       1320       1742       2683       1380       1230   Rescheduling interrupts
> > TLB:        274         21         19         22         57        317        131         61   TLB shootdowns
> >
> > patched (WMARK_HIGH, limited scan)
> > ----------------------------------
> > nr_alloc_fail 276
> > allocstall 54034
> >
> > slabs_scanned 24320
> > kswapd_steal 4507482
> > kswapd_inodesteal 262
> > kswapd_low_wmark_hit_quickly 2638
> > kswapd_high_wmark_hit_quickly 1710
> > kswapd_skip_congestion_wait 0
> > pageoutrun 32182
> >
> > CAL:         69        443        421        567        273        279        269        334   Function call interrupts
> 
> Looks amazing.

Yeah, I have strong feelings against drain_all_pages() in the direct
reclaim path. The intuition is, once drain_all_pages() is called, the 
later on direct reclaims will have less chance to fill the drained
buffers and therefore forced into drain_all_pages() again and again.

drain_all_pages() is probably an overkill for preventing OOM.
Generally speaking, it's questionable to "squeeze the last page before
OOM".

A typical desktop enters thrashing storms before OOM, as Hugh pointed
out, this may well not the end users wanted. I agree with him and
personally prefer some applications to be OOM killed rather than the
whole system goes unusable thrashing like mad.

> > LOC:     514736     511698     510993     514069     514185     512986     513838     511229   Local timer interrupts
> > RES:       2153       1556       1126       1351       3047       1554       1131       1560   Rescheduling interrupts
> > TLB:        209         26         20         15         71        315        117         71   TLB shootdowns
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
> > ----------------------------------------------------------------
> >
> > start time: 3
> > total time: 50
> > nr_alloc_fail 162
> > allocstall 45523
> >
> > CPU             count     real total  virtual total    delay total
> >                   921     3024540200     3009244668    37123129525
> > IO              count    delay total  delay average
> >                     0              0              0ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   357     4891766796             13ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
> > -----------------------------------------------------------------
> >
> > start time: 272
> > total time: 509
> > nr_alloc_fail 3913
> > allocstall 541789
> >
> > CPU             count     real total  virtual total    delay total
> >                  1044     3445476208     3437200482   229919915202
> > IO              count    delay total  delay average
> >                     0              0              0ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   452    34691441605             76ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
> > --------------------------------------------------------------------------------
> >
> > start time: 278
> > total time: 513
> > nr_alloc_fail 4737
> > allocstall 436392
> >
> >
> > CPU             count     real total  virtual total    delay total
> >                  1024     3371487456     3359441487   225088210977
> > IO              count    delay total  delay average
> >                     1      160631171            160ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   367    30809994722             83ms
> > dd: read=20480, write=0, cancelled_write=0
> >
> >
> > no cond_resched():
> 
> What's this?

I tried a modified patch that also removes the cond_resched() call in
__alloc_pages_direct_reclaim(), between try_to_free_pages() and
get_page_from_freelist(). It seems not helping noticeably.

It looks safe to remove that cond_resched() as we already have such
calls in shrink_page_list().

> >
> > start time: 263
> > total time: 516
> > nr_alloc_fail 5144
> > allocstall 436787
> >
> > CPU             count     real total  virtual total    delay total
> >                  1018     3305497488     3283831119   241982934044
> > IO              count    delay total  delay average
> >                     0              0              0ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   328    31398481378             95ms
> > dd: read=0, write=0, cancelled_write=0
> >
> > zone_watermark_ok_safe():
> >
> > start time: 266
> > total time: 513
> > nr_alloc_fail 4526
> > allocstall 440246
> >
> > CPU             count     real total  virtual total    delay total
> >                  1119     3640446568     3619184439   240945024724
> > IO              count    delay total  delay average
> >                     3      303620082            101ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   372    27320731898             73ms
> > dd: read=77824, write=0, cancelled_write=0
> >

> > start time: 275
> 
> What's meaing of start time?

It's the time taken to start 1000 dd's.
 
> > total time: 517
> 
> Total time is elapsed time on your experiment?

Yeah. They are generated with this script.

$ cat ~/bin/test-dd-sparse.sh
 
#!/bin/sh

mount /dev/sda7 /fs

tic=$(date +'%s')

for i in `seq 1000`
do
	truncate -s 1G /fs/sparse-$i
	dd if=/fs/sparse-$i of=/dev/null &>/dev/null &
done

tac=$(date +'%s')
echo start time: $((tac-tic))

wait

tac=$(date +'%s')
echo total time: $((tac-tic))

egrep '(nr_alloc_fail|allocstall)' /proc/vmstat
egrep '(CAL|RES|LOC|TLB)' /proc/interrupts

> > nr_alloc_fail 4694
> > allocstall 431021
> >
> >
> > CPU             count     real total  virtual total    delay total
> >                  1073     3534462680     3512544928   234056498221
> 
> What's meaning of CPU fields?

It's "waiting for a CPU (while being runnable)" as described in
Documentation/accounting/delay-accounting.txt.

> > IO              count    delay total  delay average
> >                     0              0              0ms
> > SWAP            count    delay total  delay average
> >                     0              0              0ms
> > RECLAIM         count    delay total  delay average
> >                   386    34751778363             89ms
> > dd: read=0, write=0, cancelled_write=0
> >
> 
> Where is vanilla data for comparing latency?
> Personally, It's hard to parse your data.

Sorry it's somehow too much data and kernel revisions.. The base kernel's
average latency is 29ms:

base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
-------------------------------------------------------------------------------

CPU             count     real total  virtual total    delay total
                 1122     3676441096     3656793547   274182127286
IO              count    delay total  delay average
                    3      291765493             97ms
SWAP            count    delay total  delay average
                    0              0              0ms
RECLAIM         count    delay total  delay average
                 1350    39229752193             29ms
dd: read=45056, write=0, cancelled_write=0

start time: 245
total time: 526
nr_alloc_fail 14586
allocstall 1578343
LOC:     533981     529210     528283     532346     533392     531314     531705     528983   Local timer interrupts
RES:       3123       2177       1676       1580       2157       1974       1606       1696   Rescheduling interrupts
CAL:     218392     218631     219167     219217     218840     218985     218429     218440   Function call interrupts
TLB:        175         13         21         18         62        309        119         42   TLB shootdowns

> 
> > CC: Mel Gorman <mel@...ux.vnet.ibm.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@...el.com>
> > ---
> >  fs/buffer.c          |    4 ++--
> >  include/linux/swap.h |    3 ++-
> >  mm/page_alloc.c      |   20 +++++---------------
> >  mm/vmscan.c          |   31 +++++++++++++++++++++++--------
> >  4 files changed, 32 insertions(+), 26 deletions(-)
> > --- linux-next.orig/mm/vmscan.c       2011-04-29 10:42:14.000000000 +0800
> > +++ linux-next/mm/vmscan.c    2011-04-30 21:59:33.000000000 +0800
> > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
> >   * returns:  0, if no pages reclaimed
> >   *           else, the number of pages reclaimed
> >   */
> > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > -                                     struct scan_control *sc)
> > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
> > +                                       struct zonelist *zonelist,
> > +                                       struct scan_control *sc)
> >  {
> >       int priority;
> >       unsigned long total_scanned = 0;
> > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
> >       struct zoneref *z;
> >       struct zone *zone;
> >       unsigned long writeback_threshold;
> > +     unsigned long min_reclaim = sc->nr_to_reclaim;
> 
> Hmm,
> 
> >
> >       get_mems_allowed();
> >       delayacct_freepages_start();
> > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
> >       if (scanning_global_lru(sc))
> >               count_vm_event(ALLOCSTALL);
> >
> > +     if (preferred_zone)
> > +             sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
> > +
> 
> Hmm, I don't like this idea.
> The goal of direct reclaim path is to reclaim pages asap, I beleive.
> Many thing should be achieve of background kswapd.
> If admin changes min_free_kbytes, it can affect latency of direct reclaim.
> It doesn't make sense to me.

Yeah, it does increase delays.. in the 1000 dd case, roughly from 30ms
to 90ms. This is a major drawback.

> >       for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> >               sc->nr_scanned = 0;
> >               if (!priority)
> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
> >                       }
> >               }
> >               total_scanned += sc->nr_scanned;
> > -             if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > -                     goto out;
> > +             if (sc->nr_reclaimed >= min_reclaim) {
> > +                     if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> > +                             goto out;
> 
> I can't understand the logic.
> if nr_reclaimed is bigger than min_reclaim, it's always greater than
> nr_to_reclaim. What's meaning of min_reclaim?

In direct reclaim, min_reclaim will be the legacy SWAP_CLUSTER_MAX and
sc->nr_to_reclaim will be increased to the zone's high watermark and
is kind of "max to reclaim".

> 
> > +                     if (total_scanned > 2 * sc->nr_to_reclaim)
> > +                             goto out;
> 
> If there are lots of dirty pages in LRU?
> If there are lots of unevictable pages in LRU?
> If there are lots of mapped page in LRU but may_unmap = 0 cases?
> I means it's rather risky early conclusion.

That test means to avoid scanning too much on __GFP_NORETRY direct
reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
the LRU pages seem hard to reclaim. And the problem in the 1000 dd
case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
fails from time to time, with lots of IPIs that may hurt large
machines a lot.

> 
> > +                     if (preferred_zone &&
> > +                         zone_watermark_ok_safe(preferred_zone, sc->order,
> > +                                     high_wmark_pages(preferred_zone),
> > +                                     zone_idx(preferred_zone), 0))
> > +                             goto out;
> > +             }
> 
> As I said, I think direct reclaim path sould be fast if possbile and
> it should not a function of min_free_kbytes.

Right.

> Of course, there are lots of tackle for keep direct reclaim path's consistent
> latency but at least, I don't want to add another source.

OK.

Thanks,
Fengguang

View attachment "mm-keep-freed-pages-in-direct-reclaim.patch" of type "text/x-diff" (6105 bytes)