lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZuO1CtpGgwyf8Hui@casper.infradead.org>
Date: Fri, 13 Sep 2024 04:44:10 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Jens Axboe <axboe@...nel.dk>, Christian Theune <ct@...ingcircus.io>,
	linux-mm@...ck.org,
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Daniel Dao <dqminh@...udflare.com>,
	Dave Chinner <david@...morbit.com>, clm@...a.com,
	regressions@...ts.linux.dev, regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Thu, Sep 12, 2024 at 03:56:17PM -0700, Linus Torvalds wrote:
> On Thu, 12 Sept 2024 at 15:30, Jens Axboe <axboe@...nel.dk> wrote:
> >
> > It might be an iomap thing... Other file systems do use it, but to
> > various degrees, and XFS is definitely the primary user.
> 
> I have to say, I looked at the iomap code, and it's disgusting.

I'm not going to comment on this because I think it's unrelated to
the problem.

We have reports of bad entries being returned from page cache lookups.
Sometimes they're pages which have been freed, sometimes they're pages
which are very definitely in use by a different filesystem.

I think that's what the underlying problem is here (or else we have
two problems).  I'm not convinced that it's necessarily related to large
folios, but it's certainly easier to reproduce with large folios.

I've looked at a number of explanations for this.  Could it be a page
that's being freed without being removed from the xarray?  We seem to
have debug that would trigger in that case, so I don't think so.

Could it be a page with a messed-up refcount?  Again, I think we'd
notice the VM_BUG_ON_PAGE() in put_page_testzero(), so I don't think
it's that either.

My current best guess is that we have an xarray node with a stray pointer
in it; that the node is freed from one xarray, allocated to a different
xarray, but not properly cleared.  But I can't reproduce the problem,
so that's pure speculation on my part.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ