linux-kernel - Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 4 Jul 2016 10:37:03 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Mel Gorman <mgorman@...hsingularity.net>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>, Rik van Riel <riel@...riel.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Johannes Weiner <hannes@...xchg.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

On Fri, Jul 01, 2016 at 09:01:08PM +0100, Mel Gorman wrote:
> (Sorry for the resend, I accidentally sent the branch that still had the
> Signed-off-by's from mmotm still applied which is incorrect.)
> 
> Previous releases double accounted LRU stats on the zone and the node
> because it was required by should_reclaim_retry. The last patch in the
> series removes the double accounting. It's not integrated with the series
> as reviewers may not like the solution. If not, it can be safely dropped
> without a major impact to the results.
> 
> Changelog since v7
> o Rebase onto current mmots
> o Avoid double accounting of stats in node and zone
> o Kswapd will avoid more reclaim if an eligible zone is available
> o Remove some duplications of sc->reclaim_idx and classzone_idx
> o Print per-node stats in zoneinfo
> 
> Changelog since v6
> o Correct reclaim_idx when direct reclaiming for memcg
> o Also account LRU pages per zone for compaction/reclaim
> o Add page_pgdat helper with more efficient lookup
> o Init pgdat LRU lock only once
> o Slight optimisation to wake_all_kswapds
> o Always wake kcompactd when kswapd is going to sleep
> o Rebase to mmotm as of June 15th, 2016
> 
> Changelog since v5
> o Rebase and adjust to changes
> 
> Changelog since v4
> o Rebase on top of v3 of page allocator optimisation series
> 
> Changelog since v3
> o Rebase on top of the page allocator optimisation series
> o Remove RFC tag
> 
> This is the latest version of a series that moves LRUs from the zones to
> the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
> is a current rebase, the test results were based on mmotm as of June 23rd.
> Conceptually, this series is simple but there are a lot of details. Some
> of the broad motivations for this are;
> 
> 1. The residency of a page partially depends on what zone the page was
>    allocated from.  This is partially combatted by the fair zone allocation
>    policy but that is a partial solution that introduces overhead in the
>    page allocator paths.
> 
> 2. Currently, reclaim on node 0 behaves slightly different to node 1. For
>    example, direct reclaim scans in zonelist order and reclaims even if
>    the zone is over the high watermark regardless of the age of pages
>    in that LRU. Kswapd on the other hand starts reclaim on the highest
>    unbalanced zone. A difference in distribution of file/anon pages due
>    to when they were allocated results can result in a difference in 
>    again. While the fair zone allocation policy mitigates some of the
>    problems here, the page reclaim results on a multi-zone node will
>    always be different to a single-zone node.
>    it was scheduled on as a result.
> 
> 3. kswapd and the page allocator scan zones in the opposite order to
>    avoid interfering with each other but it's sensitive to timing.  This
>    mitigates the page allocator using pages that were allocated very recently
>    in the ideal case but it's sensitive to timing. When kswapd is allocating
>    from lower zones then it's great but during the rebalancing of the highest
>    zone, the page allocator and kswapd interfere with each other. It's worse
>    if the highest zone is small and difficult to balance.
> 
> 4. slab shrinkers are node-based which makes it harder to identify the exact
>    relationship between slab reclaim and LRU reclaim.
> 
> The reason we have zone-based reclaim is that we used to have
> large highmem zones in common configurations and it was necessary
> to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> less of a concern as machines with lots of memory will (or should) use
> 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> rare. Machines that do use highmem should have relatively low highmem:lowmem
> ratios than we worried about in the past.

Hello Mel,

I agree the direction absolutely. However, I have a concern on highmem
system as you already mentioned.

Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
In such system, LRU churning by skipping other zone pages frequently
might be significant for the performance.

How big ratio between highmem:lowmem do you think a problem?

> 
> Conceptually, moving to node LRUs should be easier to understand. The
> page allocator plays fewer tricks to game reclaim and reclaim behaves
> similarly on all nodes. 
> 
> The series has been tested on a 16 core UMA machine and a 2-socket 48
> core NUMA machine. The UMA results are presented in most cases as the NUMA
> machine behaved similarly.

I guess you would already test below with various highmem system(e.g.,
2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?

> 
> pagealloc
> ---------
> 
> This is a microbenchmark that shows the benefit of removing the fair zone
> allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
> shown as the other orders were comparable.
> 
>                                            4.7.0-rc4                  4.7.0-rc4
>                                       mmotm-20160623                 nodelru-v8
> Min      total-odr0-1               490.00 (  0.00%)           463.00 (  5.51%)
> Min      total-odr0-2               349.00 (  0.00%)           325.00 (  6.88%)
> Min      total-odr0-4               288.00 (  0.00%)           272.00 (  5.56%)
> Min      total-odr0-8               250.00 (  0.00%)           235.00 (  6.00%)
> Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
> Min      total-odr0-32              223.00 (  0.00%)           205.00 (  8.07%)
> Min      total-odr0-64              217.00 (  0.00%)           202.00 (  6.91%)
> Min      total-odr0-128             214.00 (  0.00%)           207.00 (  3.27%)
> Min      total-odr0-256             242.00 (  0.00%)           242.00 (  0.00%)
> Min      total-odr0-512             272.00 (  0.00%)           265.00 (  2.57%)
> Min      total-odr0-1024            290.00 (  0.00%)           283.00 (  2.41%)
> Min      total-odr0-2048            302.00 (  0.00%)           296.00 (  1.99%)
> Min      total-odr0-4096            311.00 (  0.00%)           306.00 (  1.61%)
> Min      total-odr0-8192            314.00 (  0.00%)           309.00 (  1.59%)
> Min      total-odr0-16384           315.00 (  0.00%)           309.00 (  1.90%)
> Min      total-odr1-1               741.00 (  0.00%)           716.00 (  3.37%)
> Min      total-odr1-2               565.00 (  0.00%)           524.00 (  7.26%)
> Min      total-odr1-4               457.00 (  0.00%)           427.00 (  6.56%)
> Min      total-odr1-8               408.00 (  0.00%)           371.00 (  9.07%)
> Min      total-odr1-16              383.00 (  0.00%)           344.00 ( 10.18%)
> Min      total-odr1-32              378.00 (  0.00%)           334.00 ( 11.64%)
> Min      total-odr1-64              383.00 (  0.00%)           334.00 ( 12.79%)
> Min      total-odr1-128             376.00 (  0.00%)           342.00 (  9.04%)
> Min      total-odr1-256             381.00 (  0.00%)           343.00 (  9.97%)
> Min      total-odr1-512             388.00 (  0.00%)           349.00 ( 10.05%)
> Min      total-odr1-1024            386.00 (  0.00%)           356.00 (  7.77%)
> Min      total-odr1-2048            389.00 (  0.00%)           362.00 (  6.94%)
> Min      total-odr1-4096            389.00 (  0.00%)           362.00 (  6.94%)
> Min      total-odr1-8192            389.00 (  0.00%)           362.00 (  6.94%)
> 
> This shows a steady improvement throughout. The primary benefit is from
> reduced system CPU usage which is obvious from the overall times;
> 
>            4.7.0-rc4   4.7.0-rc4
>         mmotm-20160623nodelru-v8
> User          191.39      191.61
> System       2651.24     2504.48
> Elapsed      2904.40     2757.01
> 
> The vmstats also showed that the fair zone allocation policy was definitely
> removed as can be seen here;
> 
> 
>                              4.7.0-rc3   4.7.0-rc3
>                           mmotm-20160623 nodelru-v8
> DMA32 allocs               28794771816           0
> Normal allocs              48432582848 77227356392
> Movable allocs                       0           0
> 
> tiobench on ext4
> ----------------
> 
> tiobench is a benchmark that artifically benefits if old pages remain resident
> while new pages get reclaimed. The fair zone allocation policy mitigates this
> problem so pages age fairly. While the benchmark has problems, it is important
> that tiobench performance remains constant as it implies that page aging
> problems that the fair zone allocation policy fixes are not re-introduced.
> 
>                                          4.7.0-rc4             4.7.0-rc4
>                                     mmotm-20160623            nodelru-v8
> Min      PotentialReadSpeed        89.65 (  0.00%)       90.34 (  0.77%)
> Min      SeqRead-MB/sec-1          82.68 (  0.00%)       83.13 (  0.54%)
> Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.15 ( -0.84%)
> Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.23 ( -1.20%)
> Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.25 (  0.52%)
> Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.76 (  0.84%)
> Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.95 (  7.95%)
> Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.94 ( -1.05%)
> Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.46 (  2.10%)
> Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.58 ( -1.86%)
> Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.93 (  7.22%)
> Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       78.84 (  3.18%)
> Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.35 ( -1.03%)
> Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       78.69 ( -1.70%)
> Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       71.38 ( -2.06%)
> Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       75.81 ( -0.13%)
> Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.12 ( -5.08%)
> Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.02 (  0.00%)
> Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.99 ( -5.71%)
> Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
> Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.89 ( -3.26%)
> 
> This shows that the series has little or not impact on tiobench which is
> desirable. It indicates that the fair zone allocation policy was removed
> in a manner that didn't reintroduce one class of page aging bug. There
> were only minor differences in overall reclaim activity
> 
>                              4.7.0-rc4   4.7.0-rc4
>                           mmotm-20160623nodelru-v8
> Minor Faults                    645838      644036
> Major Faults                       573         593
> Swap Ins                             0           0
> Swap Outs                            0           0
> Allocation stalls                   24           0
> DMA allocs                           0           0
> DMA32 allocs                  46041453    44154171
> Normal allocs                 78053072    79865782
> Movable allocs                       0           0
> Direct pages scanned             10969       54504
> Kswapd pages scanned          93375144    93250583
> Kswapd pages reclaimed        93372243    93247714
> Direct pages reclaimed           10969       54504
> Kswapd efficiency                  99%         99%
> Kswapd velocity              13741.015   13711.950
> Direct efficiency                 100%        100%
> Direct velocity                  1.614       8.014
> Percentage direct scans             0%          0%
> Zone normal velocity          8641.875   13719.964
> Zone dma32 velocity           5100.754       0.000
> Zone dma velocity                0.000       0.000
> Page writes by reclaim           0.000       0.000
> Page writes file                     0           0
> Page writes anon                     0           0
> Page reclaim immediate              37          54
> 
> kswapd activity was roughly comparable. There were differences in direct
> reclaim activity but negligible in the context of the overall workload
> (velocity of 8 pages per second with the patches applied, 1.6 pages per
> second in the baseline kernel).

Hmm, nodelru's allocation stall is zero above but how does direct page
scanning/reclaimed happens?

Above, DMA32 allocs in nodelru is almost same but zone dma32 velocity
is zero. What does it means?

> 
> pgbench read-only large configuration on ext4
> ---------------------------------------------
> 
> pgbench is a database benchmark that can be sensitive to page reclaim
> decisions. This also checks if removing the fair zone allocation policy
> is safe
> 
> pgbench Transactions
>                         4.7.0-rc4             4.7.0-rc4
>                    mmotm-20160623            nodelru-v8
> Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
> Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
> Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
> Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
> Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
> Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
> 
> Negligible differences again. As with tiobench, overall reclaim activity
> was comparable.
> 
> bonnie++ on ext4
> ----------------
> 
> No interesting performance difference, negligible differences on reclaim
> stats.
> 
> paralleldd on ext4
> ------------------
> 
> This workload uses varying numbers of dd instances to read large amounts of
> data from disk.
> 
>                                4.7.0-rc3             4.7.0-rc3
>                           mmotm-20160615         nodelru-v7r17
> Amean    Elapsd-1       181.57 (  0.00%)      179.63 (  1.07%)
> Amean    Elapsd-3       188.29 (  0.00%)      183.68 (  2.45%)
> Amean    Elapsd-5       188.02 (  0.00%)      181.73 (  3.35%)
> Amean    Elapsd-7       186.07 (  0.00%)      184.11 (  1.05%)
> Amean    Elapsd-12      188.16 (  0.00%)      183.51 (  2.47%)
> Amean    Elapsd-16      189.03 (  0.00%)      181.27 (  4.10%)
> 
>            4.7.0-rc3   4.7.0-rc3
>         mmotm-20160615nodelru-v7r17
> User         1439.23     1433.37
> System       8332.31     8216.01
> Elapsed      3619.80     3532.69
> 
> There is a slight gain in performance, some of which is from the reduced system
> CPU usage. There areminor differences in reclaim activity but nothing significant
> 
>                              4.7.0-rc3   4.7.0-rc3
>                           mmotm-20160615nodelru-v7r17
> Minor Faults                    362486      358215
> Major Faults                      1143        1113
> Swap Ins                            26           0
> Swap Outs                         2920         482
> DMA allocs                           0           0
> DMA32 allocs                  31568814    28598887
> Normal allocs                 46539922    49514444
> Movable allocs                       0           0
> Allocation stalls                    0           0
> Direct pages scanned                 0           0
> Kswapd pages scanned          40886878    40849710
> Kswapd pages reclaimed        40869923    40835207
> Direct pages reclaimed               0           0
> Kswapd efficiency                  99%         99%
> Kswapd velocity              11295.342   11563.344
> Direct efficiency                 100%        100%
> Direct velocity                  0.000       0.000
> Slabs scanned                   131673      126099
> Direct inode steals                 57          60
> Kswapd inode steals                762          18
> 
> It basically shows that kswapd was active at roughly the same rate in
> both kernels. There was also comparable slab scanning activity and direct
> reclaim was avoided in both cases. There appears to be a large difference
> in numbers of inodes reclaimed but the workload has few active inodes and
> is likely a timing artifact. It's interesting to note that the node-lru
> did not swap in any pages but given the low swap activity, it's unlikely
> to be significant.
> 
> stutter
> -------
> 
> stutter simulates a simple workload. One part uses a lot of anonymous
> memory, a second measures mmap latency and a third copies a large file.
> The primary metric is checking for mmap latency.
> 
> stutter
>                              4.7.0-rc4             4.7.0-rc4
>                         mmotm-20160623            nodelru-v8
> Min         mmap     16.6283 (  0.00%)     16.1394 (  2.94%)
> 1st-qrtle   mmap     54.7570 (  0.00%)     55.2975 ( -0.99%)
> 2nd-qrtle   mmap     57.3163 (  0.00%)     57.5230 ( -0.36%)
> 3rd-qrtle   mmap     58.9976 (  0.00%)     58.0537 (  1.60%)
> Max-90%     mmap     59.7433 (  0.00%)     58.3910 (  2.26%)
> Max-93%     mmap     60.1298 (  0.00%)     58.4801 (  2.74%)
> Max-95%     mmap     73.4112 (  0.00%)     58.5537 ( 20.24%)
> Max-99%     mmap     92.8542 (  0.00%)     58.9673 ( 36.49%)
> Max         mmap   1440.6569 (  0.00%)    137.6875 ( 90.44%)
> Mean        mmap     59.3493 (  0.00%)     55.5153 (  6.46%)
> Best99%Mean mmap     57.2121 (  0.00%)     55.4194 (  3.13%)
> Best95%Mean mmap     55.9113 (  0.00%)     55.2813 (  1.13%)
> Best90%Mean mmap     55.6199 (  0.00%)     55.1044 (  0.93%)
> Best50%Mean mmap     53.2183 (  0.00%)     52.8330 (  0.72%)
> Best10%Mean mmap     45.9842 (  0.00%)     42.3740 (  7.85%)
> Best5%Mean  mmap     43.2256 (  0.00%)     38.8660 ( 10.09%)
> Best1%Mean  mmap     32.9388 (  0.00%)     27.7577 ( 15.73%)
> 
> This shows a number of improvements with the worst-case outlier greatly
> improved.
> 
> Some of the vmstats are interesting
> 
>                              4.7.0-rc4   4.7.0-rc4
>                           mmotm-20160623nodelru-v8
> Swap Ins                           163         239
> Swap Outs                            0           0
> Allocation stalls                 2603           0
> DMA allocs                           0           0
> DMA32 allocs                 618719206  1303037965
> Normal allocs                891235743   229914091
> Movable allocs                       0           0
> Direct pages scanned            216787        3173
> Kswapd pages scanned          50719775    41732250
> Kswapd pages reclaimed        41541765    41731168
> Direct pages reclaimed          209159        3173
> Kswapd efficiency                  81%         99%
> Kswapd velocity              16859.554   14231.043
> Direct efficiency                  96%        100%
> Direct velocity                 72.061       1.082
> Percentage direct scans             0%          0%
> Zone normal velocity          8431.777   14232.125
> Zone dma32 velocity           8499.838       0.000
> Zone dma velocity                0.000       0.000
> Page writes by reclaim     6215049.000       0.000
> Page writes file               6215049           0
> Page writes anon                     0           0
> Page reclaim immediate           70673         143
> Sector Reads                  81940800    81489388
> Sector Writes                100158984    99161860
> Page rescued immediate               0           0
> Slabs scanned                  1366954       21196
> 
> While this is not guaranteed in all cases, this particular test showed
> a large reduction in direct reclaim activity. It's also worth noting
> that no page writes were issued from reclaim context.
> 
> This series is not without its hazards. There are at least three areas
> that I'm concerned with even though I could not reproduce any problems in
> that area.
> 
> 1. Reclaim/compaction is going to be affected because the amount of reclaim is
>    no longer targetted at a specific zone. Compaction works on a per-zone basis
>    so there is no guarantee that reclaiming a few THP's worth page pages will
>    have a positive impact on compaction success rates.
> 
> 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
>    are called is now different. This may or may not be a problem but if it
>    is, it'll be because shrinkers are not called enough and some balancing
>    is required.
> 
> 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
>    distributed between zones and the fair zone allocation policy used to do
>    something very similar for anon. The distribution is now different but not
>    necessarily in any way that matters but it's still worth bearing in mind.