linux-kernel - Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased execution time of the kswapd0 process and symptoms as if there is not enough memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H4D8Sq1-pbgZb8J_0VeNO=MZqDYPM7aauXqLHDM70UmAg@mail.gmail.com>
Date: Thu, 4 Jul 2024 18:25:17 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Andrea Gelmini <andrea.gelmini@...il.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>, Btrfs BTRFS <linux-btrfs@...r.kernel.org>, 
	dsterba@...e.com, josef@...icpanda.com
Subject: Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased
 execution time of the kswapd0 process and symptoms as if there is not enough memory

On Thu, Jul 4, 2024 at 3:48 PM Andrea Gelmini <andrea.gelmini@...il.com> wrote:
>
> Il giorno gio 4 lug 2024 alle ore 15:47 Andrea Gelmini
> <andrea.gelmini@...il.com> ha scritto:
> > I send you everything when I collect enough data.
>
> Here we are.
>
> Kernel rc6+branch:
>     Output of bfptrace:
>     https://pastebin.com/P9RFp5mg

So a couple interesting things here, which we didn't get in the short
capture from Mikhail:

1) There's apparently multiple tasks entering the shrinker at the same time:
     kswapd0, Chrome_ChildIOT, Chrome_IOThread, chrome, Xorg.

2) In some cases we get very large negative numbers for the number of
extent maps to scan.
    This shouldn't happen and either our own btrfs counter might have
overflowed or some other bug,
    or the super block's shrinker is being called with sc->nr_to_scan
negative, and outside btrfs' control,
    and it seems outside of control of the VFS's shrinker callback
(see fs/super.c:super_cache_scan()).

>
>     Recording of tar session: (summary: start fast, then flipping super slow)
>     https://asciinema.org/a/BxYI83TkrlOhEe42IWXNY135D
>
>     Recording of htop session: (summary: PSI high and two threads at 100%)
>     https://asciinema.org/a/ZwGSepZZ8TSpFfPssACUUXcCB

Ok, so maybe I missed it, but I haven't kswapd0 in there, or nothing
taking 100% CPU.
Maybe it was just Mikhail running into that?

I was looking at the memory PSI and I never noticed it going over 60%.
As for cpu and IO PSI, for cpu it was always low, under 3% from what
I've seen and for IO even lower than that, very close to 0%.

So I'm surprised that you get an unresponsive desktop.

>
>
> Kernel 6.6.36:
>     Recording of tar session: (summary: tar always fast)
>     https://asciinema.org/a/a6dOkbjyPFkkQ5aNTaRiFD3H8
>
>     Recording of htop session: (summary: no threads and PSI load)
>     https://asciinema.org/a/mFsypWzHfSdsjrIQf8zpzNpKo

Interestingly, here the memory PSI stays at 0% or very close to that,
it never reaches anything close to the 60%.

>
> If you need to run for longer time, I can do it in the weekend.
> If you need dump of my BTRFS fs, no problem, but I need 'btrfs image
> -s" working (point is: scrambling filenames).

Ok, so I haven been delaying my reply because I kept accumulating
things for you (or Mikhail) to try, and avoid sending several messages
with very little.

So first thing, I tried reproducing your scenario like you described
in a previous message using tar:

On a fresh btrfs filesystem, I cloned Linus' kernel tree into /mnt/git/linux
Compiled a kernel.
Then copied the tree 3 times like this:

cd /mnt/git
cp --reflink=never -r linux linux2
cp --reflink=never -r linux linux3
cp --reflink=never -r linux linux4

The total size of /mnt/git was 62G (as reported by:  du -hs /mnt/git).

Than I ran:

cd /mnt/git
tar cp git/ | pv > /dev/null

With htop in parallel, the bpftrace script, and since my htop version
doesn't show PSI information (probably an older version than yours), I
kept monitoring PSI like this:

watch -d -n 3 'echo "cpu:\n"; cat /proc/pressure/cpu ; echo
"\nmemory:\n" ; cat /proc/pressure/memory ; echo "\nio:\n" ; cat
/proc/pressure/io'

Nothing went out of the roof, the machine was always responsive, never
seen kswapd0 anywhere near the top, and the process using most CPU was
tar (and always under 30%).
PSI had all values low.

The shrinker was being triggered very often, for small numbers (mostly
under 1000, and most of the time much less than that), but I never had
those large negative numbers nor apparently different tasks entering
into it concurrently.
It took a few seconds at most in each run.

I also tried monitoring while doing the "cp --reflink=never -r"
commands and while PSI often peaked to 92%, 93%, the system was always
responsive (and such IO PSI seems reasonable since we are doing a lot
of read and write IO).

So several different things to try here:

1) First let's check that the problem is really a consequence of the shrinker.
    Try this patch:

    https://gist.githubusercontent.com/fdmanana/b44abaade0000d28ba0e1e1ae3ac4fee/raw/5c9bf0beb5aa156b893be2837c9244d035962c74/gistfile1.txt

    This disables the shrinker. This is just to confirm if I'm looking
in the right direction, if your problem is the same as Mikhail's and
double check his bisection.

2) Then drop that patch that disables the shrinker.
     With all the previous 4 patches applied, apply this one on top of them:

     https://gist.githubusercontent.com/fdmanana/9cea16ca56594f8c7e20b67dc66c6c94/raw/557bd5f6b37b65d210218f8da8987b74bfe5e515/gistfile1.txt

     The goal here is to see if the extent map eviction done by the
shrinker is making reads from other tasks too slow, and check if
that's what0s making your system unresponsive.

3) Then drop the patch from step 2), and on top of the previous 4
patches from my git tree, apply this one:

     https://gist.githubusercontent.com/fdmanana/a7c9c2abb69c978cf5b80c2f784243d5/raw/b4cca964904d3ec15c74e36ccf111a3a2f530520/gistfile1.txt

     This is just to confirm if we do have concurrent calls to the
shrinker, as the tracing seems to suggest, and where the negative
numbers come from.
     It also helps to check if not allowing concurrent calls to it, by
skipping if it's already running, helps making the problems go away.

>
> Thanks a lot,

Thanks a lot to you and Mikhail, not just for the reporting but also
to apply patches, compile a kernel, run the tests and do all those
valuable observations which are all very time consuming.

Thanks!

> Gelma