lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H6G43cb-k-efam8=ydR0L_MdEXvFtLf4T6uqakuS1FBiw@mail.gmail.com>
Date: Fri, 5 Jul 2024 12:00:11 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Andrea Gelmini <andrea.gelmini@...il.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>, Btrfs BTRFS <linux-btrfs@...r.kernel.org>, 
	dsterba@...e.com, josef@...icpanda.com
Subject: Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased
 execution time of the kswapd0 process and symptoms as if there is not enough memory

On Thu, Jul 4, 2024 at 11:15 PM Andrea Gelmini <andrea.gelmini@...il.com> wrote:
>
> Il giorno gio 4 lug 2024 alle ore 19:25 Filipe Manana
> <fdmanana@...nel.org> ha scritto:
> > 2) In some cases we get very large negative numbers for the number of
> > extent maps to scan.
> >     This shouldn't happen and either our own btrfs counter might have
> > overflowed or some other bug,
>
> Well, I was thinking about my specific odds, and I tried this:
> a) kernel 6.6.36;
> b) on spare partition nvme created a new shiny btrfs;
> c) then mount it forcing compression;
> d) multiple parallel cp of kernel and libreoffice src;
> e) reboot with same rc6+branch already used;
> f) tar of the new btrfs: no problem at all;
> g) let it finish;
> h) tar of /.snapshots: PSI memory skyrocket, and usual slowdown reading;
> i) stop it;
> l) again tar of the new btrfs: no problem
> m) repeat a few times.
>
> You can see the output here:
> https://asciinema.org/a/rJpGWvXYH6IDBXWYhtJckkKWo
>
> In the end you see I kill tar and let the PSI going down to zero, if
> you are interested.
>
> > Ok, so maybe I missed it, but I haven't kswapd0 in there, or nothing
> > taking 100% CPU.
> > Maybe it was just Mikhail running into that?
>
> To have this effect and the extreme luggish response (I mean, click
> something and it takes more than 30 seconds to react)
> I need to work at least one day on my laptop. At this point also
> cycling to virtual desktop takes a lot.
>
> Thinking about my different use case:
> a) i always suspend. I just reboot when change kernel. So, I can work
> for weeks with same kernel. Suspend2RAM, not disk, btw;
> b) months ago I let run beesd for a day.
>
> > So I'm surprised that you get an unresponsive desktop.
> Same point as before. In this case is not so luggish, but - i.e. - if
> I click for screenlock it doesn't start immediately, it waits for a
> little bit more than one second.

Oh I see that on my main desktop which only uses ext4 and always has 2
qemu vms usually running debian and opensuse.
Sometimes even if the VMs aren't doing anything, but they used to be
doing IO heavy testing, the desktop in the host gets unresponsive,
clicking the screenlock often takes at least some 5 seconds, or
changing workspaces takes a few seconds too, etc. Shouldn't happen in
theory.

>
> > Interestingly, here the memory PSI stays at 0% or very close to that,
> > it never reaches anything close to the 60%.
>
> You see the same thing with the last test with new btrfs partition.
> New partition: ~0%
> /.snapshots/: near 60%.

It could be due to heavy fragmentation, but that should only be too
slow if you were using a spinning disk.
I think somewhere you mentioned nvme or ssd.

Removing the extent maps could cause extra reads of metadata and be slow.
But the number of extent maps removed on every iteration is relatively
small, and round-robin, so... it's strange that it causes such huge
pressure and desktop unresponsiveness.
We will know if that's the case with the 2nd test patch.

>
>
> > With htop in parallel, the bpftrace script, and since my htop version
> > doesn't show PSI information (probably an older version than yours), I
> > kept monitoring PSI like this:
>
> Well, mine is taken from here:
> https://github.com/htop-dev/htop.git
> Compiled with:
> ./configure --enable-capabilities --enable-delayacct --enable-sensors
> --enable-werror   --enable-affinity
> And tweaked config file. If you want I can send it.

Thanks, I'll have to try it eventually.

>
>
> > So several different things to try here:
>
> I stop here for the moment. I have to sleep.
> In the weekend I do the rest and reply to you!

Sure, take your time. It takes time patching and building kernels,
plus the testing, etc.
Many thanks for that!

>
> > Thanks a lot to you and Mikhail, not just for the reporting but also
> > to apply patches, compile a kernel, run the tests and do all those
> > valuable observations which are all very time consuming.
>
> My little contribution to free software!
>
> Ciao,
> Gelma

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ