lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 11 Jan 2024 18:45:05 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Kairui Song <ryncsn@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Charan Teja Kalla <quic_charante@...cinc.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, stable@...r.kernel.org
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

On Thu, Jan 11, 2024 at 11:24 AM Kairui Song <ryncsn@...il.com> wrote:
>
> Yu Zhao <yuzhao@...gle.com> 于2024年1月11日周四 15:02写道:
> > Could you try the attached patch on the mainline v6.7 and see how it
> > compares with the results above? Thanks.
>
> Hi Yu,
>
> Thanks for the patch, it helped in some degrees, but not as effective:
> On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
> and reran the test:
>
> Refault distance series:
> ==================================================================
> Execution Results after 901 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4224            27030724835.9   0.16 txn/s
> ------------------------------------------------------------------
>   TOTAL           4224            27030724835.9   0.16 txn/s
>
> workingset_nodes 111349
> workingset_refault_anon 261331
> workingset_refault_file 42862224
> workingset_activate_anon 0
> workingset_activate_file 13803763
> workingset_restore_anon 250743
> workingset_restore_file 599031
> workingset_nodereclaim 23708
>
> memcg    67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
>  node     0
>          10     347163     518379      207449
>                      0          0r          2e          0p      33017r
>    1726749e          0p
>                      1          0r          0e          0p       7278r
>     496268e          0p
>                      2          0r          0e          0p      19789r
>      55418e          0p
>                      3          0r          0e          0p          0r
>          0e    4747801p
>                                 0           0           0           0
>          0           0
>          11     283279     154400     4791558
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          12     158723     431513       37647
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          13      44775     104986       27258
>                      0        576R        982T          0     2488768R
>    5769505T          0
>                      1          0R          0T          0     2335910R
>    3357277T          0
>                      2          0R          0T          0      647398R
>     753021T          0
>                      3          0R         20T          0       52725R
>    4740516T          0
>                           2819476L      31196O    2551928Y       8298N
>       5549F       5329A
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             12.81       546.32        39.04         0.00
> 520178      37171          0
> dm-1              0.05         1.10         0.00         0.00
> 1044          0          0
> nvme0n1          13.17       561.99        41.19         0.00
> 535103      39219          0
> nvme1n1        5220.39    227385.96      1028.17         0.00
> 216505545     978976          0
> zram0          2440.61      2856.32      6907.13         0.00
> 2719644    6576628          0
>
>                total        used        free      shared  buff/cache   available
> Mem:           31830       11251         332           0       20246       20144
> Swap:          31829        3761       28068
>
> Your attachment:
> ==================================================================
> Execution Results after 905 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     4070            27170023578.4   0.15 txn/s
> ------------------------------------------------------------------
>   TOTAL           4070            27170023578.4   0.15 txn/s
>
> workingset_nodes 121864
> workingset_refault_anon 430917
> workingset_refault_file 42915675
> workingset_activate_anon 100194
> workingset_activate_file 21619480
> workingset_restore_anon 100194
> workingset_restore_file 165054
> workingset_nodereclaim 26851
>
> memcg    65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
>  node     0
>           8     418689      55033      106862
>                      0         16r         17e          0p    2789768r
>    6034831e          0p
>                      1          0r          0e          0p     239664r
>     490278e          0p
>                      2          0r          0e          0p      79145r
>     126408e          0p
>                      3         23r         23e          0p      23404r
>      27107e    4736933p
>                                 0           0           0           0
>          0           0
>           9     322798     237713     4759110
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          10     182729     942701        5348
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>          11     120287        560         375
>                      0      25187R      29324T          0     1679308R
>    4256147T          0
>                      1          0R          0T          0      153592R
>     364122T          0
>                      2          0R          0T          0       51825R
>      98646T          0
>                      3        101R       2944T          0       13985R
>    4743515T          0
>                           7702245L     865749O    6514831Y      16843N
>      15088F      14167A
>
> Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
> kB_read    kB_wrtn    kB_dscd
> dm-0             11.49       489.97        41.80         0.00
> 488006      41633          0
> dm-1              0.05         1.05         0.00         0.00
> 1044          0          0
> nvme0n1          11.83       504.95        43.86         0.00
> 502932      43682          0
> nvme0n1        5145.44    218803.29       984.46         0.00
> 217928081     980520          0
> zram0          3164.11      4399.55      8257.84         0.00
> 4381952    8224812          0
>
>                total        used        free      shared  buff/cache   available
> Mem:           31830       11583         310           1       19935       19809
> Swap:          31829        3710       28119
>
> Refault distance series still have a better performance and lower total IO.
>
> Similar result on that VM:
> ==================================================================
> Execution Results after 907 seconds
> ------------------------------------------------------------------
>                   Executed        Time (µs)       Rate
>   STOCK_LEVEL     1667            27151581934.5   0.06 txn/s
> ------------------------------------------------------------------
>   TOTAL           1667            27151581934.5   0.06 txn/s
>
> While refault distance series had about ~2500 - 2600 txns, mainline
> 6.7 had about ~800 - 900 txns.
>
> Loop test so far:
> Using refault distance seriese (previous result, it doesn't change much anyway):
>   STOCK_LEVEL     2605            27120667462.8   0.10 txn/s
>   STOCK_LEVEL     3000            27106854857.2   0.11 txn/s
>   STOCK_LEVEL     2925            27066601064.4   0.11 txn/s
>   STOCK_LEVEL     2757            27035248005.2   0.10 txn/s
>   STOCK_LEVEL     1325            28053716046.8   0.05 txn/s
>   STOCK_LEVEL     717             27455091366.3   0.03 txn/s
>   STOCK_LEVEL     967             27404085208.2   0.04 txn/s
> Refault stat here:
> workingset_refault_anon 109337
> workingset_refault_file 191249716
>
> Using the attached patch:
> STOCK_LEVEL     1667            27151581934.5   0.06 txn/s
> STOCK_LEVEL     2999            27085125092.3   0.11 txn/s
> STOCK_LEVEL     2874            27120635371.2   0.11 txn/s
> STOCK_LEVEL     2658            27139142413.9   0.10 txn/s
> STOCK_LEVEL     1254            27526009063.7   0.05 txn/s
> STOCK_LEVEL     993             28065506801.8   0.04 txn/s
> STOCK_LEVEL     954             27226012906.3   0.04 txn/s
> Refault stat here:
> workingset_refault_anon 383579
> workingset_refault_file 205493832
>
> The peak performance almost equal, but still starts slow, refault is
> higher too. File refault might be interfered due to some IO layer
> issue, but anon refault is always accurate.
>
> I see the improvement you did in the attachment patch, I think
> actually they are not in conflict with the refault distance series.
> Maybe they can be combined into a even better result.
>
> Refault distance (which originally used by active/inactive LRU) is
> used here to give evicted pages priorities based on eviction distance
> and add extra feedback to PID and gen. While the PID info recorded in
> page flags/shadow represents pages's access pattern before eviction,
> and all the check and logics about it can also be improved.
>
> One critical effect of the refault distance series that boost the
> MongoDB startup (and I haven't see any negative effect of it on other
> test / workload / benchmark yet, except the overhead of memcg
> statistics itself) is it prevents overprotecting of tier 0 page: that
> is, a tier 0 page evicted but refaulted very quickly (refault distance
> < LRU / MAX_NR_GEN, this value may worth some more adjustment, but
> with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
> shadow gen holding these page shadows...) will be categorised as tier
> 1 and get protect. Other wise, if I got everything right, when most
> pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
> high refault rate, and no pages will be protect, until randomness
> causes quick repeated read of some page, so they get promoted to tier
> 3 get get protected.
>
> Now min_seq contains lower tier pages and new pages will be added to
> min_seq too, so min_seq will stay for a long time, while min_seq + 1
> holds protected full ref tier 3 pages and they stay long enough to get
> promoted as tier 3 again, so they will always be kept in memory.
> Now MongoDB will perform well even without refault distance series,
> but this period may take a long time (~15 min for the MongoDB test for
> SATA SSD, which is based on a real workload), long enough to cause
> real issue.
>
> And this also means PID won't react to workload change fast enough.
>
> Also the anon refault's refs value is adjusted by refault distance too
> in the series, it tries to split the whole LRU as at least two gens
> for refaulted pages (only page with refault distance < LRU /
> MIN_NR_GEN will have full refs set, else will have refs - 1 set as
> penalty for long time evicted and unused page, which complies with
> LRU's nature). Which seems actually decreased refault of anon pages.
>
> There are some other issue that refault distance series is trying to
> solve too, eg. if there is a user agent force MGLRU to age
> periodically for proactive memory reclaim, or MGLRU simply ages fast,
> min_seq will grow periodically and PID won't catch enough feedback
> using previous logic.

Thanks. So far I've been making shots in the dark since I haven't been
able to reproduce your results on bare metal or VMs. So, either the
benchmark itself is not reliable, which according to your results is
unlikely, or I've been using different hardware configurations. Do you
think you can share some off-the-shelf hardware configuration that I
can buy and use to reliably reproduce your results? Ideally we use the
exactly same model from, for example, Dell, HP or Lenovo.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ