[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100916152804.1b4155fd.akpm@linux-foundation.org>
Date: Thu, 16 Sep 2010 15:28:04 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Mel Gorman <mel@....ul.ie>
Cc: linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
Linux Kernel List <linux-kernel@...r.kernel.org>,
Johannes Weiner <hannes@...xchg.org>,
Minchan Kim <minchan.kim@...il.com>,
Wu Fengguang <fengguang.wu@...el.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Subject: Re: [PATCH 0/8] Reduce latencies and improve overall reclaim
efficiency v2
On Wed, 15 Sep 2010 13:27:43 +0100
Mel Gorman <mel@....ul.ie> wrote:
> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.
epic changelog!
>
> ...
>
> The tests run were as follows
>
> kernbench
> compile-based benchmark. Smoke test performance
>
> sysbench
> OLTP read-only benchmark. Will be re-run in the future as read-write
>
> micro-mapped-file-stream
> This is a micro-benchmark from Johannes Weiner that accesses a
> large sparse-file through mmap(). It was configured to run in only
> single-CPU mode but can be indicative of how well page reclaim
> identifies suitable pages.
>
> stress-highalloc
> Tries to allocate huge pages under heavy load.
>
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
>
> X86-64 micro-mapped-file-stream
>
> traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
> pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
> pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
> pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
> pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
> pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
> pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
> pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
> pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
> allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
>
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
>
> FTrace Reclaim Statistics: vmscan
>
> traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims 443 273 513 1568
> Direct reclaim pages scanned 305968 280402 600825 957933
> Direct reclaim pages reclaimed 43503 19005 30327 117191
> Direct reclaim write file async I/O 0 0 0 0
> Direct reclaim write anon async I/O 0 3 4 12
> Direct reclaim write file sync I/O 0 0 0 0
> Direct reclaim write anon sync I/O 0 0 0 0
> Wake kswapd requests 187649 132338 191695 267701
> Kswapd wakeups 3 1 4 1
> Kswapd pages scanned 4599269 4454162 4296815 3891906
> Kswapd pages reclaimed 2295947 2428434 2399818 2319706
> Kswapd reclaim write file async I/O 1 0 1 1
> Kswapd reclaim write anon async I/O 59 187 41 222
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
> Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
>
> Total pages scanned 4905237 4734564 4897640 4849839
> Total pages reclaimed 2339450 2447439 2430145 2436897
> %age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
> %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
> Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
>
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
Yes, I think the reclaimed/scanned ratio (what I call "reclaim
efficiency") is a key metric.
50% is low! What's the testcase here? micro-mapped-file-stream?
It's strange that the "total pages reclaimed" increased a little. Just
a measurement glitch?
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest waited 87 196 64 0
> Direct time congest waited 4604ms 4732ms 5420ms 0ms
> Direct full congest waited 72 145 53 0
> Direct number conditional waited 0 0 324 1315
> Direct time conditional waited 0ms 0ms 0ms 0ms
> Direct full conditional waited 0 0 0 0
> KSwapd number congest waited 20 10 15 7
> KSwapd time congest waited 1264ms 536ms 884ms 284ms
> KSwapd full congest waited 10 4 6 2
> KSwapd number conditional waited 0 0 0 0
> KSwapd time conditional waited 0ms 0ms 0ms 0ms
> KSwapd full conditional waited 0 0 0 0
>
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
>
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
> Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Is that user time plus system time? If so, why didn't user+sys equal
elapsed in the we-never-slept-in-congestion-wait() case? Because the
test's CPU got stolen by kswapd perhaps?
> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
>
> PPC64 micro-mapped-file-stream
> pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
> pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
> pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
> pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
> pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
> allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
>
> ...
>
>
> X86-64 STRESS-HIGHALLOC
> traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
> Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
> Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
> At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
>
> Success figures across the board are broadly similar.
>
> traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims 1045 944 886 887
> Direct reclaim pages scanned 135091 119604 109382 101019
> Direct reclaim pages reclaimed 88599 47535 47863 46671
> Direct reclaim write file async I/O 494 283 465 280
> Direct reclaim write anon async I/O 29357 13710 16656 13462
> Direct reclaim write file sync I/O 154 2 2 3
> Direct reclaim write anon sync I/O 14594 571 509 561
> Wake kswapd requests 7491 933 872 892
> Kswapd wakeups 814 778 731 780
> Kswapd pages scanned 7290822 15341158 11916436 13703442
> Kswapd pages reclaimed 3587336 3142496 3094392 3187151
> Kswapd reclaim write file async I/O 91975 32317 28022 29628
> Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
> Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
>
> Total pages scanned 7425913 15460762 12025818 13804461
> Total pages reclaimed 3675935 3190031 3142255 3233822
> %age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
> %age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
> %age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
> Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
> Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
>
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.
Wait. The reclaim efficiency got *worse*, didn't it? To reclaim
3,xxx,xxx pages, the number of pages we had to scan went from 7,xxx,xxx
up to 13,xxx,xxx?
>
> ...
>
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.
slub sucks :(
Is this patchset likely to have any impact on the "hey my net driver
couldn't do an order 3 allocation" reports? I guess not.
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists