linux-kernel - Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased execution time of the kswapd0 process and symptoms as if there is not enough memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H5MEyR+=_16bzq533ua-Ne3F1evZmt2NMcYoT9t=XtPig@mail.gmail.com>
Date: Thu, 4 Jul 2024 18:31:50 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Andrea Gelmini <andrea.gelmini@...il.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>, Btrfs BTRFS <linux-btrfs@...r.kernel.org>, 
	dsterba@...e.com, josef@...icpanda.com
Subject: Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased
 execution time of the kswapd0 process and symptoms as if there is not enough memory

On Thu, Jul 4, 2024 at 6:25 PM Filipe Manana <fdmanana@...nel.org> wrote:
>
> On Thu, Jul 4, 2024 at 3:48 PM Andrea Gelmini <andrea.gelmini@...il.com> wrote:
> >
> > Il giorno gio 4 lug 2024 alle ore 15:47 Andrea Gelmini
> > <andrea.gelmini@...il.com> ha scritto:
> > > I send you everything when I collect enough data.
> >
> > Here we are.
> >
> > Kernel rc6+branch:
> >     Output of bfptrace:
> >     https://pastebin.com/P9RFp5mg
>
> So a couple interesting things here, which we didn't get in the short
> capture from Mikhail:
>
> 1) There's apparently multiple tasks entering the shrinker at the same time:
>      kswapd0, Chrome_ChildIOT, Chrome_IOThread, chrome, Xorg.
>
> 2) In some cases we get very large negative numbers for the number of
> extent maps to scan.
>     This shouldn't happen and either our own btrfs counter might have
> overflowed or some other bug,
>     or the super block's shrinker is being called with sc->nr_to_scan
> negative, and outside btrfs' control,
>     and it seems outside of control of the VFS's shrinker callback
> (see fs/super.c:super_cache_scan()).
>
> >
> >     Recording of tar session: (summary: start fast, then flipping super slow)
> >     https://asciinema.org/a/BxYI83TkrlOhEe42IWXNY135D
> >
> >     Recording of htop session: (summary: PSI high and two threads at 100%)
> >     https://asciinema.org/a/ZwGSepZZ8TSpFfPssACUUXcCB
>
> Ok, so maybe I missed it, but I haven't kswapd0 in there, or nothing
> taking 100% CPU.
> Maybe it was just Mikhail running into that?
>
> I was looking at the memory PSI and I never noticed it going over 60%.
> As for cpu and IO PSI, for cpu it was always low, under 3% from what
> I've seen and for IO even lower than that, very close to 0%.
>
> So I'm surprised that you get an unresponsive desktop.
>
> >
> >
> > Kernel 6.6.36:
> >     Recording of tar session: (summary: tar always fast)
> >     https://asciinema.org/a/a6dOkbjyPFkkQ5aNTaRiFD3H8
> >
> >     Recording of htop session: (summary: no threads and PSI load)
> >     https://asciinema.org/a/mFsypWzHfSdsjrIQf8zpzNpKo
>
> Interestingly, here the memory PSI stays at 0% or very close to that,
> it never reaches anything close to the 60%.
>
> >
> > If you need to run for longer time, I can do it in the weekend.
> > If you need dump of my BTRFS fs, no problem, but I need 'btrfs image
> > -s" working (point is: scrambling filenames).
>
> Ok, so I haven been delaying my reply because I kept accumulating
> things for you (or Mikhail) to try, and avoid sending several messages
> with very little.
>
> So first thing, I tried reproducing your scenario like you described
> in a previous message using tar:
>
> On a fresh btrfs filesystem, I cloned Linus' kernel tree into /mnt/git/linux
> Compiled a kernel.
> Then copied the tree 3 times like this:
>
> cd /mnt/git
> cp --reflink=never -r linux linux2
> cp --reflink=never -r linux linux3
> cp --reflink=never -r linux linux4
>
> The total size of /mnt/git was 62G (as reported by:  du -hs /mnt/git).
>
> Than I ran:
>
> cd /mnt/git
> tar cp git/ | pv > /dev/null
>
> With htop in parallel, the bpftrace script, and since my htop version
> doesn't show PSI information (probably an older version than yours), I
> kept monitoring PSI like this:
>
> watch -d -n 3 'echo "cpu:\n"; cat /proc/pressure/cpu ; echo
> "\nmemory:\n" ; cat /proc/pressure/memory ; echo "\nio:\n" ; cat
> /proc/pressure/io'
>
> Nothing went out of the roof, the machine was always responsive, never
> seen kswapd0 anywhere near the top, and the process using most CPU was
> tar (and always under 30%).
> PSI had all values low.
>
> The shrinker was being triggered very often, for small numbers (mostly
> under 1000, and most of the time much less than that), but I never had
> those large negative numbers nor apparently different tasks entering
> into it concurrently.
> It took a few seconds at most in each run.
>
> I also tried monitoring while doing the "cp --reflink=never -r"
> commands and while PSI often peaked to 92%, 93%, the system was always
> responsive (and such IO PSI seems reasonable since we are doing a lot
> of read and write IO).
>
> So several different things to try here:
>
> 1) First let's check that the problem is really a consequence of the shrinker.
>     Try this patch:
>
>     https://gist.githubusercontent.com/fdmanana/b44abaade0000d28ba0e1e1ae3ac4fee/raw/5c9bf0beb5aa156b893be2837c9244d035962c74/gistfile1.txt
>
>     This disables the shrinker. This is just to confirm if I'm looking
> in the right direction, if your problem is the same as Mikhail's and
> double check his bisection.
>
> 2) Then drop that patch that disables the shrinker.
>      With all the previous 4 patches applied, apply this one on top of them:
>
>      https://gist.githubusercontent.com/fdmanana/9cea16ca56594f8c7e20b67dc66c6c94/raw/557bd5f6b37b65d210218f8da8987b74bfe5e515/gistfile1.txt
>
>      The goal here is to see if the extent map eviction done by the
> shrinker is making reads from other tasks too slow, and check if
> that's what0s making your system unresponsive.
>
> 3) Then drop the patch from step 2), and on top of the previous 4
> patches from my git tree, apply this one:
>
>      https://gist.githubusercontent.com/fdmanana/a7c9c2abb69c978cf5b80c2f784243d5/raw/b4cca964904d3ec15c74e36ccf111a3a2f530520/gistfile1.txt
>
>      This is just to confirm if we do have concurrent calls to the
> shrinker, as the tracing seems to suggest, and where the negative
> numbers come from.
>      It also helps to check if not allowing concurrent calls to it, by
> skipping if it's already running, helps making the problems go away.

Oh and for this one, show your 'dmesg' after your testing to see if
any stack traces or warning messages were logged (even if it happens
to solve all the problems).

Thanks!


>
> >
> > Thanks a lot,
>
> Thanks a lot to you and Mikhail, not just for the reporting but also
> to apply patches, compile a kernel, run the tests and do all those
> valuable observations which are all very time consuming.
>
> Thanks!
>
> > Gelma