lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <w5ap2zcsatkx4dmakrkjmaexwh3mnmgc5vhavb2miaj6grrzat@7kzr5vlsrmh5>
Date: Fri, 13 Jun 2025 00:24:13 +0200
From: Ryan Lahfa <ryan@...fa.xyz>
To: Antony Antony <antony.antony@...unet.com>
Cc: David Howells <dhowells@...hat.com>, 
	Antony Antony <antony@...nome.org>, Christian Brauner <brauner@...nel.org>, 
	Eric Van Hensbergen <ericvh@...nel.org>, Latchesar Ionkov <lucho@...kov.net>, 
	Dominique Martinet <asmadeus@...ewreck.org>, Christian Schoenebeck <linux_oss@...debyte.com>, 
	Sedat Dilek <sedat.dilek@...il.com>, Maximilian Bosch <maximilian@...sch.me>, 
	regressions@...ts.linux.dev, v9fs@...ts.linux.dev, netfs@...ts.linux.dev, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [REGRESSION] 9pfs issues on 6.12-rc1

Hi everyone,

Le Wed, Oct 23, 2024 at 09:38:39PM +0200, Antony Antony a écrit :
> On Wed, Oct 23, 2024 at 11:07:05 +0100, David Howells wrote:
> > Hi Antony,
> > 
> > I think the attached should fix it properly rather than working around it as
> > the previous patch did.  If you could give it a whirl?
> 
> Yes this also fix the crash.
> 
> Tested-by: Antony Antony <antony.antony@...unet.com>

I cannot confirm this fixes the crash for me. My reproducer is slightly
more complicated than Max's original one, albeit, still on NixOS and
probably uses 9p more intensively than the automated NixOS testings
workload.

Here is how to reproduce it:

$ git clone https://gerrit.lix.systems/lix
$ cd lix
$ git fetch https://gerrit.lix.systems/lix refs/changes/29/3329/8 && git checkout FETCH_HEAD
$ nix-build -A hydraJobs.tests.local-releng

I suspect the reason for why Antony considers the crash to be fixed is
that the workload used to test it requires a significant amount of
chance and retries to trigger the bug.

On my end, you can see our CI showing the symptoms:
https://buildkite.com/organizations/lix-project/pipelines/lix/builds/2357/jobs/019761e7-784e-4790-8c1b-f609270d9d19/log.

We retried probably hundreds of times and saw different corruption
patterns, Python getting confused, ld.so getting confused, systemd
sometimes too. Python had a much higher chance of crashing in many of
our tests. We reproduced it over aarch64-linux (Ampere Altra Q80-30) but
also Intel and AMD CPUs (~5 different systems).

As soon as we reverted to Linux 6.6 series, the bug went away.

We bisected but we started to have weirder problems, this is because we
encountered the original regression mentioned in October 2024 and for a
certain range of commits, we were unable to bisect anything further.

So I switched my bisection strategy to understand when the bug was
fixed, this lead me on the commit
e65a0dc1cabe71b91ef5603e5814359451b74ca7 which is the proper fix
mentioned here and on this discussion.

Reverting this on the top of 6.12 cause indeed a massive amount of
traces, see this gist [1] for examples.

Applying the "workaround patch" aka "[PATCH] 9p: Don't revert the I/O
iterator after reading" after reverting e65a0dc1cabe makes the problem
go away after 5 tries (5 tries were sufficient to trigger with the
proper fix).

If this can be helpful, the nature of the test above is to copy a
significant amount of assets to an S3 implementation (Garage) running
inside of the VM. Many of these assets comes from the Nix store which
sits over 9p.

Anyhow, I see three patterns:

- Kernel panic when starting the /init, this is the crash Max reported
  back in October 2024 and the one we started to encounter while
  bisecting this problem in the range between v6.11 and v6.12.
- systemd crashing very quickly, 
  this is what we see when reverting e65a0dc1cabe71b91ef5603e5814359451b74ca7
  on the top of v6.12 *OR* when we are around v6.12rc5.
- what the CI above shows which are userspace programs crashing after
  some serious I/O exercising has been done, which happens on the top of
  v6.12, v6.14, v6.15 (incl. stable kernels).

If you need me to test things, please let me know.

[1]: https://gist.dgnum.eu/raito/3d1fa61ebaf642218342ffe644fb6efd
-- 
Ryan Lahfa

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ