linux-kernel - Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CRf4iEKuW2qWKzbhssMbixBo3UoLPqsSk4b+Tvw8at8A@mail.gmail.com>
Date:   Fri, 15 Dec 2023 02:37:57 +0800
From:   Kairui Song <ryncsn@...il.com>
To:     Yu Zhao <yuzhao@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        Charan Teja Kalla <quic_charante@...cinc.com>,
        Kalesh Singh <kaleshsingh@...gle.com>, stable@...r.kernel.org
Subject: Re: [PATCH mm-unstable v1 1/4] mm/mglru: fix underprotected page cache

Yu Zhao <yuzhao@...gle.com> 于2023年12月14日周四 11:09写道：
> On Wed, Dec 13, 2023 at 12:59:14AM -0700, Yu Zhao wrote:
> > On Tue, Dec 12, 2023 at 8:03 PM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > Kairui Song <ryncsn@...il.com> 于2023年12月12日周二 14:52写道：
> > > >
> > > > Yu Zhao <yuzhao@...gle.com> 于2023年12月12日周二 06:07写道：
> > > > >
> > > > > On Fri, Dec 8, 2023 at 1:24 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > > >
> > > > > > Yu Zhao <yuzhao@...gle.com> 于2023年12月8日周五 14:14写道：
> > > > > > >
> > > > > > > Unmapped folios accessed through file descriptors can be
> > > > > > > underprotected. Those folios are added to the oldest generation based
> > > > > > > on:
> > > > > > > 1. The fact that they are less costly to reclaim (no need to walk the
> > > > > > >    rmap and flush the TLB) and have less impact on performance (don't
> > > > > > >    cause major PFs and can be non-blocking if needed again).
> > > > > > > 2. The observation that they are likely to be single-use. E.g., for
> > > > > > >    client use cases like Android, its apps parse configuration files
> > > > > > >    and store the data in heap (anon); for server use cases like MySQL,
> > > > > > >    it reads from InnoDB files and holds the cached data for tables in
> > > > > > >    buffer pools (anon).
> > > > > > >
> > > > > > > However, the oldest generation can be very short lived, and if so, it
> > > > > > > doesn't provide the PID controller with enough time to respond to a
> > > > > > > surge of refaults. (Note that the PID controller uses weighted
> > > > > > > refaults and those from evicted generations only take a half of the
> > > > > > > whole weight.) In other words, for a short lived generation, the
> > > > > > > moving average smooths out the spike quickly.
> > > > > > >
> > > > > > > To fix the problem:
> > > > > > > 1. For folios that are already on LRU, if they can be beyond the
> > > > > > >    tracking range of tiers, i.e., five accesses through file
> > > > > > >    descriptors, move them to the second oldest generation to give them
> > > > > > >    more time to age. (Note that tiers are used by the PID controller
> > > > > > >    to statistically determine whether folios accessed multiple times
> > > > > > >    through file descriptors are worth protecting.)
> > > > > > > 2. When adding unmapped folios to LRU, adjust the placement of them so
> > > > > > >    that they are not too close to the tail. The effect of this is
> > > > > > >    similar to the above.
> > > > > > >
> > > > > > > On Android, launching 55 apps sequentially:
> > > > > > >                            Before     After      Change
> > > > > > >   workingset_refault_anon  25641024   25598972   0%
> > > > > > >   workingset_refault_file  115016834  106178438  -8%
> > > > > >
> > > > > > Hi Yu,
> > > > > >
> > > > > > Thanks you for your amazing works on MGLRU.
> > > > > >
> > > > > > I believe this is the similar issue I was trying to resolve previously:
> > > > > > https://lwn.net/Articles/945266/
> > > > > > The idea is to use refault distance to decide if the page should be
> > > > > > place in oldest generation or some other gen, which per my test,
> > > > > > worked very well, and we have been using refault distance for MGLRU in
> > > > > > multiple workloads.
> > > > > >
> > > > > > There are a few issues left in my previous RFC series, like anon pages
> > > > > > in MGLRU shouldn't be considered, I wanted to collect feedback or test
> > > > > > cases, but unfortunately it seems didn't get too much attention
> > > > > > upstream.
> > > > > >
> > > > > > I think both this patch and my previous series are for solving the
> > > > > > file pages underpertected issue, and I did a quick test using this
> > > > > > series, for mongodb test, refault distance seems still a better
> > > > > > solution (I'm not saying these two optimization are mutually exclusive
> > > > > > though, just they do have some conflicts in implementation and solving
> > > > > > similar problem):
> > > > > >
> > > > > > Previous result:
> > > > > > ==================================================================
> > > > > > Execution Results after 905 seconds
> > > > > > ------------------------------------------------------------------
> > > > > >                   Executed        Time (µs)       Rate
> > > > > >   STOCK_LEVEL     2542            27121571486.2   0.09 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > >   TOTAL           2542            27121571486.2   0.09 txn/s
> > > > > >
> > > > > > This patch:
> > > > > > ==================================================================
> > > > > > Execution Results after 900 seconds
> > > > > > ------------------------------------------------------------------
> > > > > >                   Executed        Time (µs)       Rate
> > > > > >   STOCK_LEVEL     1594            27061522574.4   0.06 txn/s
> > > > > > ------------------------------------------------------------------
> > > > > >   TOTAL           1594            27061522574.4   0.06 txn/s
> > > > > >
> > > > > > Unpatched version is always around ~500.
> > > > >
> > > > > Thanks for the test results!
> > > > >
> > > > > > I think there are a few points here:
> > > > > > - Refault distance make use of page shadow so it can better
> > > > > > distinguish evicted pages of different access pattern (re-access
> > > > > > distance).
> > > > > > - Throttled refault distance can help hold part of workingset when
> > > > > > memory is too small to hold the whole workingset.
> > > > > >
> > > > > > So maybe part of this patch and the bits of previous series can be
> > > > > > combined to work better on this issue, how do you think?
> > > > >
> > > > > I'll try to find some time this week to look at your RFC. It'd be a
> > >
> > > Hi Yu,
> > >
> > > I'm working on V4 of the RFC now, which just update some comments, and
> > > skip anon page re-activation in refault path for mglru which was not
> > > very helpful, only some tiny adjustment.
> > > And I found it easier to test with fio, using following test script:
> > >
> > > #!/bin/bash
> > > swapoff -a
> > >
> > > modprobe brd rd_nr=1 rd_size=16777216
> > > mkfs.ext4 /dev/ram0
> > > mount /dev/ram0 /mnt
> > >
> > > mkdir -p /sys/fs/cgroup/benchmark
> > > cd /sys/fs/cgroup/benchmark
> > >
> > > echo 4G > memory.max
> > > echo $$ > cgroup.procs
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > fio -name=mglru --numjobs=12 --directory=/mnt --size=1024m \
> > >           --buffered=1 --ioengine=io_uring --iodepth=128 \
> > >           --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
> > >           --rw=randread --random_distribution=zipf:0.5 --norandommap \
> > >           --time_based --ramp_time=5m --runtime=5m --group_reporting
> > >
> > > zipf:0.5 is used here to simulate a cached read with slight bias
> > > towards certain pages.
> > > Unpatched 6.7-rc4:
> > > Run status group 0 (all jobs):
> > >    READ: bw=6548MiB/s (6866MB/s), 6548MiB/s-6548MiB/s
> > > (6866MB/s-6866MB/s), io=1918GiB (2060GB), run=300001-300001msec
> > >
> > > Patched with RFC v4:
> > > Run status group 0 (all jobs):
> > >    READ: bw=7270MiB/s (7623MB/s), 7270MiB/s-7270MiB/s
> > > (7623MB/s-7623MB/s), io=2130GiB (2287GB), run=300001-300001msec
> > >
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > >    READ: bw=7098MiB/s (7442MB/s), 7098MiB/s-7098MiB/s
> > > (7442MB/s-7442MB/s), io=2079GiB (2233GB), run=300002-300002msec
> > >
> > > MGLRU off:
> > > Run status group 0 (all jobs):
> > >    READ: bw=6525MiB/s (6842MB/s), 6525MiB/s-6525MiB/s
> > > (6842MB/s-6842MB/s), io=1912GiB (2052GB), run=300002-300002msec
> > >
> > > - If I change zipf:0.5 to random:
> > > Unpatched 6.7-rc4:
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > >    READ: bw=5975MiB/s (6265MB/s), 5975MiB/s-5975MiB/s
> > > (6265MB/s-6265MB/s), io=1750GiB (1879GB), run=300002-300002msec
> > >
> > > Patched with RFC v4:
> > > Run status group 0 (all jobs):
> > >    READ: bw=5987MiB/s (6278MB/s), 5987MiB/s-5987MiB/s
> > > (6278MB/s-6278MB/s), io=1754GiB (1883GB), run=300001-300001msec
> > >
> > > Patched with this series:
> > > Run status group 0 (all jobs):
> > >    READ: bw=5839MiB/s (6123MB/s), 5839MiB/s-5839MiB/s
> > > (6123MB/s-6123MB/s), io=1711GiB (1837GB), run=300001-300001msec
> > >
> > > MGLRU off:
> > > Run status group 0 (all jobs):
> > >    READ: bw=5689MiB/s (5965MB/s), 5689MiB/s-5689MiB/s
> > > (5965MB/s-5965MB/s), io=1667GiB (1790GB), run=300003-300003msec
> > >
> > > fio uses ramdisk so LRU accuracy will have smaller impact. The Mongodb
> > > test I provided before uses a SATA SSD so it will have a much higher
> > > impact. I'll provides a script to setup the test case and run it, it's
> > > more complex to setup than fio since involving setting up multiple
> > > replicas and auth and hundreds of GB of test fixtures, I'm currently
> > > occupied by some other tasks but will try best to send them out as
> > > soon as possible.
> >
> > Thanks! Apparently your RFC did show better IOPS with both access
> > patterns, which was a surprise to me because it had higher refaults
> > and usually higher refautls result in worse performance.
> >
> > So I'm still trying to figure out why it turned out the opposite. My
> > current guess is that:
> > 1. It had a very small but stable inactive LRU list, which was able to
> > fit into the L3 cache entirely.
> > 2. It counted few folios as workingset and therefore incurred less
> > overhead from CONFIG_PSI and/or CONFIG_TASK_DELAY_ACCT.
> >
> > Did you save workingset_refault_file when you ran the test? If so, can
> > you check the difference between this series and your RFC?
>
>
> It seems I was right about #1 above. After I scaled your test up by 20x,
> I saw my series performed ~5% faster with zipf and ~9% faster with random
> accesses.

Hi Yu,

Thank you so much for testing and sharing this result.

I'm not sure about #1, the ramdisk size, access data, are far larger
than L3 (16M on my CPU) even in down scaled test, and both random/zipf
shows similar result.

>
> IOW, I made rd_size from 16GB to 320GB, memory.max from 4GB to 80GB,
> --numjobs from 12 to 60 and --size from 1GB to 4GB.
>
> v6.7-c5 + this series
> =====================
>
> zipf
> ----
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12155: Wed Dec 13 17:50:36 2023
>   read: IOPS=5074k, BW=19.4GiB/s (20.8GB/s)(5807GiB/300007msec)
>     slat (usec): min=36, max=109326, avg=363.67, stdev=1829.97
>     clat (nsec): min=783, max=113292k, avg=1136755.10, stdev=3162056.05
>      lat (usec): min=37, max=149232, avg=1500.43, stdev=3644.21
>     clat percentiles (usec):
>      |  1.00th=[  490],  5.00th=[  519], 10.00th=[  537], 20.00th=[  553],
>      | 30.00th=[  570], 40.00th=[  586], 50.00th=[  627], 60.00th=[  840],
>      | 70.00th=[  988], 80.00th=[ 1074], 90.00th=[ 1188], 95.00th=[ 1336],
>      | 99.00th=[ 7308], 99.50th=[31327], 99.90th=[36963], 99.95th=[45351],
>      | 99.99th=[53216]
>    bw (  MiB/s): min= 8332, max=27116, per=100.00%, avg=19846.67, stdev=58.20, samples=35903
>    iops        : min=2133165, max=6941826, avg=5080741.79, stdev=14899.13, samples=35903
>   lat (nsec)   : 1000=0.01%
>   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
>   lat (usec)   : 250=0.01%, 500=1.76%, 750=52.94%, 1000=16.65%
>   lat (msec)   : 2=26.22%, 4=0.15%, 10=1.36%, 20=0.01%, 50=0.90%
>   lat (msec)   : 100=0.02%, 250=0.01%
>   cpu          : usr=5.42%, sys=87.59%, ctx=470315, majf=0, minf=2184
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=100.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
>      issued rwts: total=1522384845,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: bw=19.4GiB/s (20.8GB/s), 19.4GiB/s-19.4GiB/s (20.8GB/s-20.8GB/s), io=5807GiB (6236GB), run=300007-300007msec
>
> Disk stats (read/write):
>   ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128
>
> random
> ------
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12576: Wed Dec 13 18:00:50 2023
>   read: IOPS=3853k, BW=14.7GiB/s (15.8GB/s)(4410GiB/300014msec)
>     slat (usec): min=58, max=118605, avg=486.45, stdev=2311.45
>     clat (usec): min=3, max=169810, avg=1496.60, stdev=3982.89
>      lat (usec): min=73, max=170019, avg=1983.06, stdev=4585.87
>     clat percentiles (usec):
>      |  1.00th=[  586],  5.00th=[  627], 10.00th=[  644], 20.00th=[  668],
>      | 30.00th=[  693], 40.00th=[  725], 50.00th=[  816], 60.00th=[ 1123],
>      | 70.00th=[ 1221], 80.00th=[ 1352], 90.00th=[ 1516], 95.00th=[ 1713],
>      | 99.00th=[31851], 99.50th=[34866], 99.90th=[41681], 99.95th=[54264],
>      | 99.99th=[61080]
>    bw (  MiB/s): min= 6049, max=21328, per=100.00%, avg=15070.00, stdev=45.96, samples=35940
>    iops        : min=1548543, max=5459997, avg=3857912.87, stdev=11765.30, samples=35940
>   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
>   lat (usec)   : 500=0.01%, 750=44.64%, 1000=8.20%
>   lat (msec)   : 2=43.84%, 4=0.27%, 10=1.79%, 20=0.01%, 50=1.20%
>   lat (msec)   : 100=0.07%, 250=0.01%
>   cpu          : usr=3.19%, sys=89.87%, ctx=463840, majf=0, minf=2248
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
>      issued rwts: total=1155923744,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-15.8GB/s), io=4410GiB (4735GB), run=300014-300014msec
>
> Disk stats (read/write):
>   ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> memcg     3 /zipf
>  node     0
>           0    1521654          0           0x
>                      0          0r          0e          0p          0           0           0
>                      1          0r          0e          0p          0           0           0
>                      2          0r          0e          0p          0           0           0
>                      3          0r          0e          0p          0           0           0
>                                 0           0           0           0           0           0
>           1    1521654          0          21
>                      0          0           0           0  1077016797r 1111542014e          0p
>                      1          0           0           0   317997853r  324814007e          0p
>                      2          0           0           0    68064253r   68866308e     124302p
>                      3          0           0           0           0r          0e   12282816p
>                                 0           0           0           0           0           0
>           2    1521654          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3    1521654          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
>  node     1
>           0    1521654          0           0
>                      0          0r          0e          0p          0r          0e          0p
>                      1          0r          0e          0p          0r          0e          0p
>                      2          0r          0e          0p          0r          0e          0p
>                      3          0r          0e          0p          0r          0e          0p
>                                 0           0           0           0           0           0
>           1    1521654          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2    1521654          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3    1521654          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
> memcg     4 /random
>  node     0
>           0     600431          0           0x
>                      0          0r          0e          0p          0           0           0
>                      1          0r          0e          0p          0           0           0
>                      2          0r          0e          0p          0           0           0
>                      3          0r          0e          0p          0           0           0
>                                 0           0           0           0           0           0
>           1     600431          0    11169201
>                      0          0           0           0  1071724785r 1103937007e          0p
>                      1          0           0           0   376193810r  384852629e          0p
>                      2          0           0           0    77315518r   78596395e          0p
>                      3          0           0           0           0r          0e    9593442p
>                                 0           0           0           0           0           0
>           2     600431          1     9593442
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3     600431         36         754
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
>  node     1
>           0     600431          0           0
>                      0          0r          0e          0p          0r          0e          0p
>                      1          0r          0e          0p          0r          0e          0p
>                      2          0r          0e          0p          0r          0e          0p
>                      3          0r          0e          0p          0r          0e          0p
>                                 0           0           0           0           0           0
>           1     600431          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2     600431          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3     600431          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
>
> v6.7-c5 + RFC v3
> ================
>
> zipf
> ----
>
> mglru: (groupid=0, jobs=60): err= 0: pid=11600: Wed Dec 13 18:34:31 2023
>   read: IOPS=4816k, BW=18.4GiB/s (19.7GB/s)(5512GiB/300014msec)
>     slat (usec): min=3, max=121722, avg=384.46, stdev=2066.10
>     clat (nsec): min=356, max=174717k, avg=1197513.60, stdev=3568734.58
>      lat (usec): min=3, max=174919, avg=1581.97, stdev=4112.49
>     clat percentiles (usec):
>      |  1.00th=[  486],  5.00th=[  515], 10.00th=[  529], 20.00th=[  553],
>      | 30.00th=[  570], 40.00th=[  594], 50.00th=[  652], 60.00th=[  898],
>      | 70.00th=[  988], 80.00th=[ 1139], 90.00th=[ 1254], 95.00th=[ 1369],
>      | 99.00th=[ 6915], 99.50th=[35914], 99.90th=[42206], 99.95th=[52167],
>      | 99.99th=[61604]
>    bw (  MiB/s): min= 7716, max=26325, per=100.00%, avg=18836.65, stdev=57.20, samples=35880
>    iops        : min=1975306, max=6739280, avg=4822176.85, stdev=14642.35, samples=35880
>   lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
>   lat (usec)   : 500=2.57%, 750=50.99%, 1000=17.56%
>   lat (msec)   : 2=26.41%, 4=0.16%, 10=1.41%, 20=0.01%, 50=0.84%
>   lat (msec)   : 100=0.05%, 250=0.01%
>   cpu          : usr=4.95%, sys=88.09%, ctx=457609, majf=0, minf=2184
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
>      issued rwts: total=1445015808,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: bw=18.4GiB/s (19.7GB/s), 18.4GiB/s-18.4GiB/s (19.7GB/s-19.7GB/s), io=5512GiB (5919GB), run=300014-300014msec
>
> Disk stats (read/write):
>   ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> mglru: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128
>
> random
> ------
>
> mglru: (groupid=0, jobs=60): err= 0: pid=12024: Wed Dec 13 18:44:45 2023
>   read: IOPS=3519k, BW=13.4GiB/s (14.4GB/s)(4027GiB/300011msec)
>     slat (usec): min=54, max=136278, avg=534.57, stdev=2738.72
>     clat (usec): min=3, max=176186, avg=1638.66, stdev=4714.55
>      lat (usec): min=78, max=176426, avg=2173.23, stdev=5426.40
>     clat percentiles (usec):
>      |  1.00th=[  627],  5.00th=[  676], 10.00th=[  693], 20.00th=[  725],
>      | 30.00th=[  766], 40.00th=[  816], 50.00th=[ 1090], 60.00th=[ 1205],
>      | 70.00th=[ 1270], 80.00th=[ 1369], 90.00th=[ 1500], 95.00th=[ 1614],
>      | 99.00th=[38536], 99.50th=[41681], 99.90th=[47973], 99.95th=[65799],
>      | 99.99th=[72877]
>    bw (  MiB/s): min= 5586, max=20476, per=100.00%, avg=13760.26, stdev=45.33, samples=35904
>    iops        : min=1430070, max=5242110, avg=3522621.15, stdev=11604.46, samples=35904
>   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=0.01%, 250=0.01%
>   lat (usec)   : 500=0.01%, 750=26.33%, 1000=21.81%
>   lat (msec)   : 2=48.54%, 4=0.16%, 10=1.91%, 20=0.01%, 50=1.17%
>   lat (msec)   : 100=0.09%, 250=0.01%
>   cpu          : usr=2.74%, sys=90.35%, ctx=481356, majf=0, minf=2244
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
>      issued rwts: total=1055590880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: bw=13.4GiB/s (14.4GB/s), 13.4GiB/s-13.4GiB/s (14.4GB/s-14.4GB/s), io=4027GiB (4324GB), run=300011-300011msec
>
> Disk stats (read/write):
>   ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> memcg     3 /zipf
>  node     0
>           0    1522519          0          22
>                      0          0r          0e          0p  996363383r 1092111170e          0p
>                      1          0r          0e          0p  274581982r  235766575e          0p
>                      2          0r          0e          0p   85176438r   71356676e      96114p
>                      3          0r          0e          0p   12470364r   11510461e     221796p
>                                 0           0           0           0           0           0
>           1    1522519          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2    1522519          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3    1522519          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
>  node     1
>           0    1522519          0           0
>                      0          0r          0e          0p          0r          0e          0p
>                      1          0r          0e          0p          0r          0e          0p
>                      2          0r          0e          0p          0r          0e          0p
>                      3          0r          0e          0p          0r          0e          0p
>                                 0           0           0           0           0           0
>           1    1522519          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2    1522519          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3    1522519          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
> memcg     4 /random
>  node     0
>           0     600413          0     2289676
>                      0          0r          0e          0p  875605725r  960492874e          0p
>                      1          0r          0e          0p  411230731r  383704269e          0p
>                      2          0r          0e          0p  112639317r   97774351e          0p
>                      3          0r          0e          0p    2103334r    1766407e          0p
>                                 0           0           0           0           0           0
>           1     600413          1           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2     600413          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3     600413         35    18466878
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A
>  node     1
>           0     600413          0           0
>                      0          0r          0e          0p          0r          0e          0p
>                      1          0r          0e          0p          0r          0e          0p
>                      2          0r          0e          0p          0r          0e          0p
>                      3          0r          0e          0p          0r          0e          0p
>                                 0           0           0           0           0           0
>           1     600413          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           2     600413          0           0
>                      0          0           0           0           0           0           0
>                      1          0           0           0           0           0           0
>                      2          0           0           0           0           0           0
>                      3          0           0           0           0           0           0
>                                 0           0           0           0           0           0
>           3     600413          0           0
>                      0          0R          0T          0           0R          0T          0
>                      1          0R          0T          0           0R          0T          0
>                      2          0R          0T          0           0R          0T          0
>                      3          0R          0T          0           0R          0T          0
>                                 0L          0O          0Y          0N          0F          0A

And I reran the scaled down zipf test again:

RFC:
Jobs: 12 (f=12): [r(12)][100.0%][r=7267MiB/s][r=1860k IOPS][eta 00m:00s]7s]s]
mglru: (groupid=0, jobs=12): err= 0: pid=5159: Thu Dec 14 23:57:01 2023
  read: IOPS=1862k, BW=7274MiB/s (7628MB/s)(2131GiB/300001msec)
    slat (usec): min=60, max=4711, avg=195.05, stdev=138.41
    clat (usec): min=2, max=5097, avg=619.70, stdev=215.90
     lat (usec): min=112, max=5271, avg=814.78, stdev=237.75
    clat percentiles (usec):
     |  1.00th=[  388],  5.00th=[  408], 10.00th=[  424], 20.00th=[  457],
     | 30.00th=[  482], 40.00th=[  502], 50.00th=[  523], 60.00th=[  545],
     | 70.00th=[  603], 80.00th=[  889], 90.00th=[  988], 95.00th=[ 1037],
     | 99.00th=[ 1106], 99.50th=[ 1139], 99.90th=[ 1237], 99.95th=[ 1369],
     | 99.99th=[ 1483]
   bw (  MiB/s): min= 6526, max= 8474, per=100.00%, avg=7284.26,
stdev=48.62, samples=7176
   iops        : min=1670753, max=2169575, avg=1864770.39,
stdev=12446.01, samples=7176
  lat (usec)   : 4=0.01%, 10=0.01%, 250=0.01%, 500=38.35%, 750=33.88%
  lat (usec)   : 1000=19.46%
  lat (msec)   : 2=8.30%, 4=0.01%, 10=0.01%
  cpu          : usr=8.62%, sys=91.24%, ctx=531703, majf=0, minf=700
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=558664800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=7274MiB/s (7628MB/s), 7274MiB/s-7274MiB/s
(7628MB/s-7628MB/s), io=2131GiB (2288GB), run=300001-300001msec

workingset_refault_file 628192729

memcg    73 /benchmark
 node     0
          0    1092186          0           0x
                     0          0r          0e          0p          0
         0           0·
                     1          0r          0e          0p          0
         0           0·
                     2          0r          0e          0p          0
         0           0·
                     3          0r          0e          0p          0
         0           0·
                                0           0           0           0
         0           0·
          1    1092186          0        4283·
                     0          0           0           0   507816078r
 511714221e          0p
                     1          0           0           0     4682206r
   3201136e          0p
                     2          0           0           0       64762r
     43587e          0p
                     3          0           0           0           0r
         0e          0p
                                0           0           0           0
         0           0·
          2    1092186          0           0·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
          3    1092186          0      750308·
                     0          0R          0T          0    49689099R
  52516254T          0·
                     1          0R          0T          0     5786054R
   5786054T          0·
                     2          0R          0T          0     1140749R
   1140749T          0·
                     3          0R          0T          0           0R
         0T          0·
                                0L          0O          0Y          0N
         0F          0A

This series:
Jobs: 12 (f=12): [r(12)][100.0%][r=6447MiB/s][r=1650k IOPS][eta 00m:00s]
mglru: (groupid=0, jobs=12): err= 0: pid=3665: Fri Dec 15 00:16:06 2023
  read: IOPS=1830k, BW=7148MiB/s (7495MB/s)(2094GiB/300001msec)
    slat (usec): min=59, max=35006, avg=198.58, stdev=201.99
    clat (nsec): min=972, max=37489k, avg=630651.61, stdev=384748.50
     lat (usec): min=108, max=39688, avg=829.26, stdev=461.06
    clat percentiles (usec):
     |  1.00th=[  355],  5.00th=[  379], 10.00th=[  392], 20.00th=[  424],
     | 30.00th=[  478], 40.00th=[  510], 50.00th=[  529], 60.00th=[  553],
     | 70.00th=[  635], 80.00th=[  898], 90.00th=[ 1012], 95.00th=[ 1090],
     | 99.00th=[ 1221], 99.50th=[ 1401], 99.90th=[ 2606], 99.95th=[ 3654],
     | 99.99th=[18220]
   bw (  MiB/s): min= 4870, max= 9145, per=100.00%, avg=7157.39,
stdev=81.13, samples=7176
   iops        : min=1246811, max=2341342, avg=1832289.80,
stdev=20768.76, samples=7176
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 4=0.01%, 10=0.01%, 250=0.01%, 500=36.53%, 750=36.20%
  lat (usec)   : 1000=15.90%
  lat (msec)   : 2=11.18%, 4=0.15%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=8.59%, sys=91.27%, ctx=512635, majf=0, minf=711
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=548956313,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=7148MiB/s (7495MB/s), 7148MiB/s-7148MiB/s
(7495MB/s-7495MB/s), io=2094GiB (2249GB), run=300001-300001msec

workingset_refault_file 596790506

memcg    68 /benchmark
 node     0
        122     160248          0           0x
                     0          0r          0e          0p          0
         0           0·
                     1          0r          0e          0p          0
         0           0·
                     2          0r          0e          0p          0
         0           0·
                     3          0r          0e          0p          0
         0           0·
                                0           0           0           0
         0           0·
        123     155360          0      239405·
                     0          0           0           0      301462r
   1186271e          0p
                     1          0           0           0       80013r
    218961e          0p
                     2          0           0           0           0r
         0e     516139p
                     3          0           0           0           0r
         0e          0p
                                0           0           0           0
         0           0·
        124     150495          0      516188·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
        125     145582          0        1345·
                     0          0R          0T          0     2577270R
   4518284T          0·
                     1          0R          0T          0      290933R
    369324T          0·
                     2          0R          0T          0           0R
    752170T          0·
                     3          0R          0T          0           0R
         0T          0·
                           388483L      17226O      18419Y      95408N
      1314F        578A

I think the problem might be related to this series ages faster and so
have higher overhead in some case. In your test the test is large
scaled so MGLRU just keep reclaiming last gen, no aging, and my RFC
bring extra overhead due to workingset checking and memcg flushing
(the memcg flushing patch in unstable tree may help?), and also the
current refault distance checking model, simply glued to MGLRU (some
known issues, the most obvious issue is that refault distance check
can't prevent the file page underprotected issue at all when active is
low or empty, and using active/inactive is not accurate enough for
MGLRU), not performing good enough.

And for the MongoDB test, I still didn't have enough time to tidy up
the setup scripts and modified repo yet, sorry about this, in past few
days I only have time to check this issue at late night... but a quick
test shows interesting reading too:

RFC:
==================================================================
Execution Results after 902 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate············
  STOCK_LEVEL     2544            27114484261.0   0.09 txn/s······
------------------------------------------------------------------
  TOTAL           2544            27114484261.0   0.09 txn/s······

workingset_refault_anon 10512
workingset_refault_file 22751782

memcg    44 /system.slice/docker-1313de5323016713a0efa95d3b3f1aeafc9f43df80051bd013f3d29f1e13fa58.scope
 node     0
         12     190714      41736      640699·
                     0          0r          2e          0p          0r
   1293703e          0p
                     1          0r          0e          0p          0r
         0e     463477p
                     2          0r          0e          0p          0r
         0e    5029378p
                     3          0r          0e          0p          0r
         0e          0p
                                0           0           0           0
         0           0·
         13     139686     462351     5483828·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
         14      86529     692892        3795·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
         15      41548      47767         366·
                     0         12R       1113T          0        3497R
   1857252T          0·
                     1          0R          0T          0     1000193R
   1692818T          0·
                     2          0R          0T          0           0R
   5422505T          0·
                     3          0R          0T          0           0R
         0T          0·
                          3889671L      42917O    3674613Y      11910N
      7609F       7547A

This series:
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate············
  STOCK_LEVEL     1668            27108414456.6   0.06 txn/s······
------------------------------------------------------------------
  TOTAL           1668            27108414456.6   0.06 txn/s······

workingset_refault_anon 35277
workingset_refault_file 20335355

memcg    77 /system.slice/docker-731f3d33dca1dbea9d763a7a9519bb92c4ca1bbdb06c6a23d5203f8baad97f6e.scope
 node     0
         14     218191          0x          0x
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
         15     170722       1923     6172558·
                     0          0r          0e          0p          9r
     29052e          0p
                     1          0r          0e          0p          0r
     10643e          0p
                     2          0r          0e          0p          0r
         0e       5714p
                     3          0r          0e          0p          0r
         0e          0p
                                0           0           0           0
         0           0·
         16     127628    1223689       10249·
                     0          0           0           0           0
         0           0·
                     1          0           0           0           0
         0           0·
                     2          0           0           0           0
         0           0·
                     3          0           0           0           0
         0           0·
                                0           0           0           0
         0           0·
         17      79949      40444         408·
                     0       1413R       5628T          0      352479R
   1259370T          0·
                     1          0R          0T          0      252950R
    439843T          0·
                     2          0R          1T          0           0R
   5083446T          0·
                     3          0R          0T          0           0R
         0T          0·
                         18667726L     229222O   17641112Y      40116N
     36473F      35963A

And I've turned off all unrelated features off (psi, delayacct) for
above tests. When PSI is on, the MongoDB test shows 70 - 100 PSI SOME,
it's not using a very high performance disk.
I think this could suggest some time evict of file page is not that
costly. And page shadow can store fine grained data of page's access
distance, so maybe I can tune the refault distance checking model for
MGLRU, combine with this series, which may help to improve the protect
policy to be more balanced (not too fast, and still accurate)?