linux-kernel - Re: [PATCH V2] mm/gup: Clear the LRU flag of a page before adding to LRU batch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7DLGczt=_yWNe-CY=U8rW+RBrx+9VVi4AJU3HYr-BdLnQ@mail.gmail.com>
Date: Sun, 4 Aug 2024 20:21:42 +0800
From: Kairui Song <ryncsn@...il.com>
To: Ge Yang <yangge1116@....com>, Yu Zhao <yuzhao@...gle.com>, Chris Li <chrisl@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm <linux-mm@...ck.org>, 
	LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org, 
	Barry Song <21cnbao@...il.com>, David Hildenbrand <david@...hat.com>, baolin.wang@...ux.alibaba.com, 
	liuzixing@...on.cn, Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH V2] mm/gup: Clear the LRU flag of a page before adding to
 LRU batch

On Sun, Aug 4, 2024 at 4:03 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Sun, Aug 4, 2024 at 1:09 AM Yu Zhao <yuzhao@...gle.com> wrote:
> > On Sat, Aug 3, 2024 at 2:31 AM Ge Yang <yangge1116@....com> wrote:
> > > 在 2024/8/3 4:18, Chris Li 写道:
> > > > On Thu, Aug 1, 2024 at 6:56 PM Ge Yang <yangge1116@....com> wrote:
> > > >>
> > > >>
> > > >>
> > > >>>> I can't reproduce this problem, using tmpfs to compile linux.
> > > >>>> Seems you limit the memory size used to compile linux, which leads to
> > > >>>> OOM. May I ask why the memory size is limited to 481280kB? Do I also
> > > >>>> need to limit the memory size to 481280kB to test?
> > > >>>
> > > >>> Yes, you need to limit the cgroup memory size to force the swap
> > > >>> action. I am using memory.max = 470M.
> > > >>>
> > > >>> I believe other values e.g. 800M can trigger it as well. The reason to
> > > >>> limit the memory to cause the swap action.
> > > >>> The goal is to intentionally overwhelm the memory load and let the
> > > >>> swap system do its job. The 470M is chosen to cause a lot of swap
> > > >>> action but not too high to cause OOM kills in normal kernels.
> > > >>> In another word, high enough swap pressure but not too high to bust
> > > >>> into OOM kill. e.g. I verify that, with your patch reverted, the
> > > >>> mm-stable kernel can sustain this level of swap pressure (470M)
> > > >>> without OOM kill.
> > > >>>
> > > >>> I borrowed the 470M magic value from Hugh and verified it works with
> > > >>> my test system. Huge has a similar swab test up which is more
> > > >>> complicated than mine. It is the inspiration of my swap stress test
> > > >>> setup.
> > > >>>
> > > >>> FYI, I am using "make -j32" on a machine with 12 cores (24
> > > >>> hyperthreading). My typical swap usage is about 3-5G. I set my
> > > >>> swapfile size to about 20G.
> > > >>> I am using zram or ssd as the swap backend.  Hope that helps you
> > > >>> reproduce the problem.
> > > >>>
> > > >> Hi Chris,
> > > >>
> > > >> I try to construct the experiment according to your suggestions above.
> > > >
> > > > Hi Ge,
> > > >
> > > > Sorry to hear that you were not able to reproduce it.
> > > >
> > > >> High swap pressure can be triggered, but OOM can't be reproduced. The
> > > >> specific steps are as follows:
> > > >> root@...ntu-server-2204:/home/yangge# cp workspace/linux/ /dev/shm/ -rf
> > > >
> > > > I use a slightly different way to setup the tmpfs:
> > > >
> > > > Here is section of my script:
> > > >
> > > >          if ! [ -d $tmpdir ]; then
> > > >                  sudo mkdir -p $tmpdir
> > > >                  sudo mount -t tmpfs -o size=100% nodev $tmpdir
> > > >          fi
> > > >
> > > >          sudo mkdir -p $cgroup
> > > >          sudo sh -c "echo $mem > $cgroup/memory.max" || echo setup
> > > > memory.max error
> > > >          sudo sh -c "echo 1 > $cgroup/memory.oom.group" || echo setup
> > > > oom.group error
> > > >
> > > > Per run:
> > > >
> > > >         # $workdir is under $tmpdir
> > > >          sudo rm -rf $workdir
> > > >          mkdir -p $workdir
> > > >          cd $workdir
> > > >          echo "Extracting linux tree"
> > > >          XZ_OPT='-T0 -9 –memory=75%' tar xJf $linux_src || die "xz
> > > > extract failed"
> > > >
> > > >          sudo sh -c "echo $BASHPID > $cgroup/cgroup.procs"
> > > >          echo "Cleaning linux tree, setup defconfig"
> > > >          cd $workdir/linux
> > > >          make -j$NR_TASK clean
> > > >          make defconfig > /dev/null
> > > >          echo Kernel compile run $i
> > > >          /usr/bin/time -a -o $log make --silent -j$NR_TASK  || die "make failed"
> > > > >
> > >
> > > Thanks.
> > >
> > > >> root@...ntu-server-2204:/home/yangge# sync
> > > >> root@...ntu-server-2204:/home/yangge# echo 3 > /proc/sys/vm/drop_caches
> > > >> root@...ntu-server-2204:/home/yangge# cd /sys/fs/cgroup/
> > > >> root@...ntu-server-2204:/sys/fs/cgroup/# mkdir kernel-build
> > > >> root@...ntu-server-2204:/sys/fs/cgroup/# cd kernel-build
> > > >> root@...ntu-server-2204:/sys/fs/cgroup/kernel-build# echo 470M > memory.max
> > > >> root@...ntu-server-2204:/sys/fs/cgroup/kernel-build# echo $$ > cgroup.procs
> > > >> root@...ntu-server-2204:/sys/fs/cgroup/kernel-build# cd /dev/shm/linux/
> > > >> root@...ntu-server-2204:/dev/shm/linux# make clean && make -j24
> > > >
> > > > I am using make -j 32.
> > > >
> > > > Your step should work.
> > > >
> > > > Did you enable MGLRU in your .config file? Mine did. I attached my
> > > > config file here.
> > > >
> > >
> > > The above test didn't enable MGLRU.
> > >
> > > When MGLRU is enabled, I can reproduce OOM very soon. The cause of
> > > triggering OOM is being analyzed.
>
> Hi Ge,
>
> Just in case, maybe you can try to revert your patch and run the test
> again? I'm also seeing OOM with MGLRU with this test, Active/Inactive
> LRU is fine. But after reverting your patch, the OOM issue still
> exists.
>
> > I think this is one of the potential side effects -- Huge mentioned
> > earlier about isolate_lru_folios():
> > https://lore.kernel.org/linux-mm/503f0df7-91e8-07c1-c4a6-124cad9e65e7@google.com/
> >
> > Try this:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cfa839284b92..778bf5b7ef97 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4320,7 +4320,7 @@ static bool sort_folio(struct lruvec *lruvec,
> > struct folio *folio, struct scan_c
> >         }
> >
> >         /* ineligible */
> > -       if (zone > sc->reclaim_idx || skip_cma(folio, sc)) {
> > +       if (!folio_test_lru(folio) || zone > sc->reclaim_idx ||
> > skip_cma(folio, sc)) {
> >                 gen = folio_inc_gen(lruvec, folio, false);
> >                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
> >                 return true;
>
> Hi Yu, I tested your patch, on my system, the OOM still exists (96
> core and 256G RAM), test memcg is limited to 512M and 32 thread ().
>
> And I found the OOM seems irrelevant to either your patch or Ge's
> patch. (it may changed the OOM chance slight though)
>
> After the very quick OOM (it failed to untar the linux source code),
> checking lru_gen_full:
> memcg    47 /build-kernel-tmpfs
>  node     0
>         442       1691      29405           0
>                      0          0r          0e          0p         57r
>        617e          0p
>                      1          0r          0e          0p          0r
>          4e          0p
>                      2          0r          0e          0p          0r
>          0e          0p
>                      3          0r          0e          0p          0r
>          0e          0p
>                                 0           0           0           0
>          0           0
>         443       1683      57748         832
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>         444       1670      30207         133
>                      0          0           0           0           0
>          0           0
>                      1          0           0           0           0
>          0           0
>                      2          0           0           0           0
>          0           0
>                      3          0           0           0           0
>          0           0
>                                 0           0           0           0
>          0           0
>         445       1662          0           0
>                      0          0R         34T          0          57R
>        238T          0
>                      1          0R          0T          0           0R
>          0T          0
>                      2          0R          0T          0           0R
>          0T          0
>                      3          0R          0T          0           0R
>         81T          0
>                             13807L        324O        867Y       2538N
>         63F         18A
>
> If I repeat the test many times, it may succeed by chance, but the
> untar process is very slow and generates about 7000 generations.
>
> But if I change the untar cmdline to:
> python -c "import sys; sys.stdout.buffer.write(open('$linux_src',
> mode='rb').read())" | tar zx
>
> Then the problem is gone, it can untar the file successfully and very fast.
>
> This might be a different issue reported by Chris, I'm not sure.

After more testing, I think these are two problems (note I changed the
memcg limit to 600m later so the compile test can run smoothly).

1. OOM during the untar progress (can be workarounded by the untar
cmdline I mentioned above).
2. OOM during the compile progress (this should be the one Chris encountered).

Both 1 and 2 only exist for MGLRU.
1 can be workarounded using the cmdline I mentioned above.
2 is caused by Ge's patch, and 1 is not.

I can confirm Yu's patch fixed 2 on my system, but the 1 seems still a
problem, it's not related to this patch, maybe can be discussed
elsewhere.