linux-kernel - Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CWRH0sJLCTU80c44gG9z=ff75u_S104iot0wNv_ModdQ@mail.gmail.com>
Date: Fri, 12 Jan 2024 02:24:28 +0800
From: Kairui Song <ryncsn@...il.com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Charan Teja Kalla <quic_charante@...cinc.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, stable@...r.kernel.org
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <yuzhao@...gle.com> 于2024年1月11日周四 15:02写道：
> Could you try the attached patch on the mainline v6.7 and see how it
> compares with the results above? Thanks.

Hi Yu,

Thanks for the patch, it helped in some degrees, but not as effective:
On that exclusive baremetal, I did a resetup, rebase on 6.7 mainline
and reran the test:

Refault distance series:
==================================================================
Execution Results after 901 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     4224            27030724835.9   0.16 txn/s
------------------------------------------------------------------
  TOTAL           4224            27030724835.9   0.16 txn/s

workingset_nodes 111349
workingset_refault_anon 261331
workingset_refault_file 42862224
workingset_activate_anon 0
workingset_activate_file 13803763
workingset_restore_anon 250743
workingset_restore_file 599031
workingset_nodereclaim 23708

memcg    67 /machine.slice/libpod-edbf5a3cb2574c60180c1fb5ddb2fb160df00bcee3758b7649f2b31baa97ed78.scope/container
 node     0
         10     347163     518379      207449
                     0          0r          2e          0p      33017r
   1726749e          0p
                     1          0r          0e          0p       7278r
    496268e          0p
                     2          0r          0e          0p      19789r
     55418e          0p
                     3          0r          0e          0p          0r
         0e    4747801p
                                0           0           0           0
         0           0
         11     283279     154400     4791558
                     0          0           0           0           0
         0           0
                     1          0           0           0           0
         0           0
                     2          0           0           0           0
         0           0
                     3          0           0           0           0
         0           0
                                0           0           0           0
         0           0
         12     158723     431513       37647
                     0          0           0           0           0
         0           0
                     1          0           0           0           0
         0           0
                     2          0           0           0           0
         0           0
                     3          0           0           0           0
         0           0
                                0           0           0           0
         0           0
         13      44775     104986       27258
                     0        576R        982T          0     2488768R
   5769505T          0
                     1          0R          0T          0     2335910R
   3357277T          0
                     2          0R          0T          0      647398R
    753021T          0
                     3          0R         20T          0       52725R
   4740516T          0
                          2819476L      31196O    2551928Y       8298N
      5549F       5329A

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
kB_read    kB_wrtn    kB_dscd
dm-0             12.81       546.32        39.04         0.00
520178      37171          0
dm-1              0.05         1.10         0.00         0.00
1044          0          0
nvme0n1          13.17       561.99        41.19         0.00
535103      39219          0
nvme1n1        5220.39    227385.96      1028.17         0.00
216505545     978976          0
zram0          2440.61      2856.32      6907.13         0.00
2719644    6576628          0

               total        used        free      shared  buff/cache   available
Mem:           31830       11251         332           0       20246       20144
Swap:          31829        3761       28068

Your attachment:
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     4070            27170023578.4   0.15 txn/s
------------------------------------------------------------------
  TOTAL           4070            27170023578.4   0.15 txn/s

workingset_nodes 121864
workingset_refault_anon 430917
workingset_refault_file 42915675
workingset_activate_anon 100194
workingset_activate_file 21619480
workingset_restore_anon 100194
workingset_restore_file 165054
workingset_nodereclaim 26851

memcg    65 /machine.slice/libpod-c6d8c5fedb9b390ec7f1db7d0d7c57d6a284a94e74a3923d93ea0ce4e4ffdf28.scope/container
 node     0
          8     418689      55033      106862
                     0         16r         17e          0p    2789768r
   6034831e          0p
                     1          0r          0e          0p     239664r
    490278e          0p
                     2          0r          0e          0p      79145r
    126408e          0p
                     3         23r         23e          0p      23404r
     27107e    4736933p
                                0           0           0           0
         0           0
          9     322798     237713     4759110
                     0          0           0           0           0
         0           0
                     1          0           0           0           0
         0           0
                     2          0           0           0           0
         0           0
                     3          0           0           0           0
         0           0
                                0           0           0           0
         0           0
         10     182729     942701        5348
                     0          0           0           0           0
         0           0
                     1          0           0           0           0
         0           0
                     2          0           0           0           0
         0           0
                     3          0           0           0           0
         0           0
                                0           0           0           0
         0           0
         11     120287        560         375
                     0      25187R      29324T          0     1679308R
   4256147T          0
                     1          0R          0T          0      153592R
    364122T          0
                     2          0R          0T          0       51825R
     98646T          0
                     3        101R       2944T          0       13985R
   4743515T          0
                          7702245L     865749O    6514831Y      16843N
     15088F      14167A

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s
kB_read    kB_wrtn    kB_dscd
dm-0             11.49       489.97        41.80         0.00
488006      41633          0
dm-1              0.05         1.05         0.00         0.00
1044          0          0
nvme0n1          11.83       504.95        43.86         0.00
502932      43682          0
nvme0n1        5145.44    218803.29       984.46         0.00
217928081     980520          0
zram0          3164.11      4399.55      8257.84         0.00
4381952    8224812          0

               total        used        free      shared  buff/cache   available
Mem:           31830       11583         310           1       19935       19809
Swap:          31829        3710       28119

Refault distance series still have a better performance and lower total IO.

Similar result on that VM:
==================================================================
Execution Results after 907 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     1667            27151581934.5   0.06 txn/s
------------------------------------------------------------------
  TOTAL           1667            27151581934.5   0.06 txn/s

While refault distance series had about ~2500 - 2600 txns, mainline
6.7 had about ~800 - 900 txns.

Loop test so far:
Using refault distance seriese (previous result, it doesn't change much anyway):
  STOCK_LEVEL     2605            27120667462.8   0.10 txn/s
  STOCK_LEVEL     3000            27106854857.2   0.11 txn/s
  STOCK_LEVEL     2925            27066601064.4   0.11 txn/s
  STOCK_LEVEL     2757            27035248005.2   0.10 txn/s
  STOCK_LEVEL     1325            28053716046.8   0.05 txn/s
  STOCK_LEVEL     717             27455091366.3   0.03 txn/s
  STOCK_LEVEL     967             27404085208.2   0.04 txn/s
Refault stat here:
workingset_refault_anon 109337
workingset_refault_file 191249716

Using the attached patch:
STOCK_LEVEL     1667            27151581934.5   0.06 txn/s
STOCK_LEVEL     2999            27085125092.3   0.11 txn/s
STOCK_LEVEL     2874            27120635371.2   0.11 txn/s
STOCK_LEVEL     2658            27139142413.9   0.10 txn/s
STOCK_LEVEL     1254            27526009063.7   0.05 txn/s
STOCK_LEVEL     993             28065506801.8   0.04 txn/s
STOCK_LEVEL     954             27226012906.3   0.04 txn/s
Refault stat here:
workingset_refault_anon 383579
workingset_refault_file 205493832

The peak performance almost equal, but still starts slow, refault is
higher too. File refault might be interfered due to some IO layer
issue, but anon refault is always accurate.

I see the improvement you did in the attachment patch, I think
actually they are not in conflict with the refault distance series.
Maybe they can be combined into a even better result.

Refault distance (which originally used by active/inactive LRU) is
used here to give evicted pages priorities based on eviction distance
and add extra feedback to PID and gen. While the PID info recorded in
page flags/shadow represents pages's access pattern before eviction,
and all the check and logics about it can also be improved.

One critical effect of the refault distance series that boost the
MongoDB startup (and I haven't see any negative effect of it on other
test / workload / benchmark yet, except the overhead of memcg
statistics itself) is it prevents overprotecting of tier 0 page: that
is, a tier 0 page evicted but refaulted very quickly (refault distance
< LRU / MAX_NR_GEN, this value may worth some more adjustment, but
with LRU / MAX_NR_GEN, it can be imaged as an idea that having a small
shadow gen holding these page shadows...) will be categorised as tier
1 and get protect. Other wise, if I got everything right, when most
pages are stuck in tier 0 and keep refaulting, tier 0 will have a very
high refault rate, and no pages will be protect, until randomness
causes quick repeated read of some page, so they get promoted to tier
3 get get protected.

Now min_seq contains lower tier pages and new pages will be added to
min_seq too, so min_seq will stay for a long time, while min_seq + 1
holds protected full ref tier 3 pages and they stay long enough to get
promoted as tier 3 again, so they will always be kept in memory.
Now MongoDB will perform well even without refault distance series,
but this period may take a long time (~15 min for the MongoDB test for
SATA SSD, which is based on a real workload), long enough to cause
real issue.

And this also means PID won't react to workload change fast enough.

Also the anon refault's refs value is adjusted by refault distance too
in the series, it tries to split the whole LRU as at least two gens
for refaulted pages (only page with refault distance < LRU /
MIN_NR_GEN will have full refs set, else will have refs - 1 set as
penalty for long time evicted and unused page, which complies with
LRU's nature). Which seems actually decreased refault of anon pages.

There are some other issue that refault distance series is trying to
solve too, eg. if there is a user agent force MGLRU to age
periodically for proactive memory reclaim, or MGLRU simply ages fast,
min_seq will grow periodically and PID won't catch enough feedback
using previous logic.