lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALWz4iyt94KdRXTwr07+s5TPYtcwBX7xScQcqUvwVCnDMLH_TA@mail.gmail.com>
Date:	Wed, 11 Apr 2012 16:37:00 -0700
From:	Ying Han <yinghan@...gle.com>
To:	Mel Gorman <mgorman@...e.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Konstantin Khlebnikov <khlebnikov@...nvz.org>,
	Hugh Dickins <hughd@...gle.com>, Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/3] Removal of lumpy reclaim V2

On Wed, Apr 11, 2012 at 9:38 AM, Mel Gorman <mgorman@...e.de> wrote:
> Andrew, these three patches should replace the two lumpy reclaim patches
> you already have. When applied, there is no functional difference (slightly
> changes in layout) but the changelogs are better.
>
> Changelog since V1
> o Ying pointed out that compaction was waiting on page writeback and the
>  description of the patches in V1 was broken. This version is the same
>  except that it is structured differently to explain that waiting on
>  page writeback is removed.
> o Rebased to v3.4-rc2
>
> This series removes lumpy reclaim and some stalling logic that was
> unintentionally being used by memory compaction. The end result
> is that stalling on dirty pages during page reclaim now depends on
> wait_iff_congested().
>
> Four kernels were compared
>
> 3.3.0     vanilla
> 3.4.0-rc2 vanilla
> 3.4.0-rc2 lumpyremove-v2 is patch one from this series
> 3.4.0-rc2 nosync-v2r3 is the full series
>
> Removing lumpy reclaim saves almost 900K of text where as the full series
> removes 1200K of text.
>
>   text    data     bss     dec     hex filename
> 6740375 1927944 2260992 10929311         a6c49f vmlinux-3.4.0-rc2-vanilla
> 6739479 1927944 2260992 10928415         a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2
> 6739159 1927944 2260992 10928095         a6bfdf vmlinux-3.4.0-rc2-nosync-v2
>
> There are behaviour changes in the series and so tests were run with
> monitoring of ftrace events. This disrupts results so the performance
> results are distorted but the new behaviour should be clearer.
>
> fs-mark running in a threaded configuration showed little of interest as
> it did not push reclaim aggressively
>
> FS-Mark Multi Threaded
>                        3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
> Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
> Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
> Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
> Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
> Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
> Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
> Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
> Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
> MMTests Statistics: duration
> Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
> User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
> Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
>
> MMTests Statistics: vmstat
> Page Ins                                       80532       82212       81420       79480
> Page Outs                                  111434984   111456240   111437376   111582628
> Swap Ins                                           0           0           0           0
> Swap Outs                                          0           0           0           0
> Direct pages scanned                           44881       27889       27453       34843
> Kswapd pages scanned                        25841428    25860774    25861233    25843212
> Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
> Direct pages reclaimed                         44881       27889       27453       34843
> Kswapd efficiency                                99%         99%         99%         99%
> Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
> Direct efficiency                               100%        100%        100%        100%
> Direct velocity                               37.783      23.375      23.031      29.188
> Percentage direct scans                           0%          0%          0%          0%
>
> ftrace showed that there was no stalling on writeback or pages submitted
> for IO from reclaim context.
>
>
> postmark was similar and while it was more interesting, it also did not
> push reclaim heavily.
>
> POSTMARK
>                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
> Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
> Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
> Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
> Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
> Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
> Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
> Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
>
> MMTests Statistics: duration
> Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
> User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
> Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
>
> MMTests Statistics: vmstat
> Page Ins                                    13710192    13729032    13727944    13760136
> Page Outs                                   43071140    42987228    42733684    42931624
> Swap Ins                                           0           0           0           0
> Swap Outs                                          0           0           0           0
> Direct pages scanned                               0           0           0           0
> Kswapd pages scanned                         9941613     9937443     9939085     9929154
> Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
> Direct pages reclaimed                             0           0           0           0
> Kswapd efficiency                                99%         99%         99%         99%
> Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
> Direct efficiency                               100%        100%        100%        100%
> Direct velocity                                0.000       0.000       0.000       0.000
>
> It looks like here that the full series regresses performance but as ftrace
> showed no usage of wait_iff_congested() or sync reclaim I am assuming it's
> a disruption due to monitoring. Other data such as memory usage, page IO,
> swap IO all looked similar.
>
> Running a benchmark with a plain DD showed nothing very interesting. The
> full series stalled in wait_iff_congested() slightly less but stall times
> on vanilla kernels were marginal.
>
> Running a benchmark that hammered on file-backed mappings showed stalls
> due to congestion but not in sync writebacks
>
> MICRO
>                                     3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
> MMTests Statistics: duration
> Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
> User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
> Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
>
> MMTests Statistics: vmstat
> Page Ins                                      108712      120708       97224      110344
> Page Outs                                  155514576   156017404   155813676   156193256
> Swap Ins                                           0           0           0           0
> Swap Outs                                          0           0           0           0
> Direct pages scanned                         2599253     1550480     2512822     2414760
> Kswapd pages scanned                        69742364    71150694    68839041    69692533
> Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
> Direct pages reclaimed                         53693       94750       61792       75205
> Kswapd efficiency                                49%         48%         50%         49%
> Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
> Direct efficiency                                 2%          6%          2%          3%
> Direct velocity                             1432.174     845.464    1379.807    1317.446
> Percentage direct scans                           3%          2%          3%          3%
> Page writes by reclaim                             0           0           0           0
> Page writes file                                   0           0           0           0
> Page writes anon                                   0           0           0           0
> Page reclaim immediate                             0           0           0        1218
> Page rescued immediate                             0           0           0           0
> Slabs scanned                                  15360       16384       13312       16384
> Direct inode steals                                0           0           0           0
> Kswapd inode steals                             4340        4327        1630        4323
>
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                 0          0          0          0
> Direct time   congest     waited               0ms        0ms        0ms        0ms
> Direct full   congest     waited                 0          0          0          0
> Direct number conditional waited               900        870        754        789
> Direct time   conditional waited               0ms        0ms        0ms       20ms
> Direct full   conditional waited                 0          0          0          0
> KSwapd number congest     waited              2106       2308       2116       1915
> KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
> KSwapd full   congest     waited              1346       1530       1202       1278
> KSwapd number conditional waited             12922      16320      10943      14670
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
>
>
> Reclaim statistics are not radically changed. The stall times in kswapd
> are massive but it is clear that it is due to calls to congestion_wait()
> and that is almost certainly the call in balance_pgdat(). Otherwise stalls
> due to dirty pages are non-existant.
>
> I ran a benchmark that stressed high-order allocation. This is very
> artifical load but was used in the past to evaluate lumpy reclaim and
> compaction. Generally I look at allocation success rates and latency figures.
>
> STRESS-HIGHALLOC
>                 3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
> Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
> Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
> while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
>
> MMTests Statistics: duration
> Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
> User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
> Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
>
> MMTests Statistics: vmstat
> Page Ins                                     4486020     2807256     2855944     2876244
> Page Outs                                    7261600     7973688     7975320     7986120
> Swap Ins                                       31694           0           0           0
> Swap Outs                                      98179           0           0           0
> Direct pages scanned                           53494       57731       34406      113015
> Kswapd pages scanned                         6271173     1287481     1278174     1219095
> Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
> Direct pages reclaimed                          1468       14564       16649       92456
> Kswapd efficiency                                32%         99%         98%         98%
> Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
> Direct efficiency                                 2%         25%         48%         81%
> Direct velocity                               46.047      50.092      29.672      97.306
> Percentage direct scans                           0%          4%          2%          8%
> Page writes by reclaim                       1616049           0           0           0
> Page writes file                             1517870           0           0           0
> Page writes anon                               98179           0           0           0
> Page reclaim immediate                        103778       27339        9796       17831
> Page rescued immediate                             0           0           0           0
> Slabs scanned                                1096704      986112      980992      998400
> Direct inode steals                              223      215040      216736      247881
> Kswapd inode steals                           175331       61548       68444       63066
> Kswapd skipped wait                            21991           0           1           0
> THP fault alloc                                    1         135         125         134
> THP collapse alloc                               393         311         228         236
> THP splits                                        25          13           7           8
> THP fault fallback                                 0           0           0           0
> THP collapse fail                                  3           5           7           7
> Compaction stalls                                865        1270        1422        1518
> Compaction success                               370         401         353         383
> Compaction failures                              495         869        1069        1135
> Compaction pages moved                        870155     3828868     4036106     4423626
> Compaction move failure                        26429       23865       29742       27514
>
> Success rates are completely hosed for 3.4-rc2 which is almost certainly
> due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. I
> expected this would happen for kswapd and impair allocation success rates
> (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much
> a difference: 80% less scanning, 37% less reclaim by kswapd
>
> In comparison, reclaim/compaction is not aggressive and gives up easily
> which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be
> much more aggressive about reclaim/compaction than THP allocations are. The
> stress test above is allocating like neither THP or hugetlbfs but is much
> closer to THP.
>
> Mainline is now impaired in terms of high order allocation under heavy load
> although I do not know to what degree as I did not test with __GFP_REPEAT.
> Keep this in mind for bugs related to hugepage pool resizing, THP allocation
> and high order atomic allocation failures from network devices.
>
> In terms of congestion throttling, I see the following for this test
>
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                 3          0          0          0
> Direct time   congest     waited               0ms        0ms        0ms        0ms
> Direct full   congest     waited                 0          0          0          0
> Direct number conditional waited               957        512       1081       1075
> Direct time   conditional waited               0ms        0ms        0ms        0ms
> Direct full   conditional waited                 0          0          0          0
> KSwapd number congest     waited                36          4          3          5
> KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
> KSwapd full   congest     waited                30          4          3          5
> KSwapd number conditional waited             88514        197        332        542
> KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
> KSwapd full   conditional waited                49          0          0          0
>
> The "conditional waited" times are the most interesting as this is directly
> impacted by the number of dirty pages encountered during scan. As lumpy
> reclaim is no longer scanning contiguous ranges, it is finding fewer dirty
> pages. This brings wait times from about 5 seconds to 0. kswapd itself is
> still calling congestion_wait() so it'll still stall but it's a lot less.
>
> In terms of the type of IO we were doing, I see this
>
> FTrace Reclaim Statistics: mm_vmscan_writepage
> Direct writes anon  sync                         0          0          0          0
> Direct writes anon  async                        0          0          0          0
> Direct writes file  sync                         0          0          0          0
> Direct writes file  async                        0          0          0          0
> Direct writes mixed sync                         0          0          0          0
> Direct writes mixed async                        0          0          0          0
> KSwapd writes anon  sync                         0          0          0          0
> KSwapd writes anon  async                    91682          0          0          0
> KSwapd writes file  sync                         0          0          0          0
> KSwapd writes file  async                   822629          0          0          0
> KSwapd writes mixed sync                         0          0          0          0
> KSwapd writes mixed async                        0          0          0          0
>
> In 3.2, kswapd was doing a bunch of async writes of pages but
> reclaim/compaction was never reaching a point where it was doing sync
> IO. This does not guarantee that reclaim/compaction was not calling
> wait_on_page_writeback() but I would consider it unlikely. It indicates
> that merging patches 2 and 3 to stop reclaim/compaction calling
> wait_on_page_writeback() should be safe.
>
>  include/trace/events/vmscan.h |   40 ++-----
>  mm/vmscan.c                   |  263 ++++-------------------------------------
>  2 files changed, 37 insertions(+), 266 deletions(-)
>
> --
> 1.7.9.2
>

It might be a naive question, what we do w/ users with the following
in the .config file?

# CONFIG_COMPACTION is not set

--Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ