linux-ext4 - Re: generic/418 regression seen on 5.12-rc3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210318221620.GW3420@casper.infradead.org>
Date:   Thu, 18 Mar 2021 22:16:20 +0000
From:   Matthew Wilcox <willy@...radead.org>
To:     Eric Whitney <enwlinux@...il.com>
Cc:     linux-ext4@...r.kernel.org, tytso@....edu
Subject: Re: generic/418 regression seen on 5.12-rc3

On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> * Matthew Wilcox <willy@...radead.org>:
> > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > time when run on 5.12-rc3 and 5.12-rc1 kernels.  This first occurred when
> > > running the 1k test case using kvm-xfstests.  I was then able to bisect the
> > > failure to a patch landed in the -rc1 merge window:
> > > 
> > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > 
> > Thanks for letting me know.  This failure is new to me.
> 
> Sure - it's useful to know that it's new to you.  Ted said he's also going
> to test XFS with a large number of generic/418 trials which would be a
> useful comparison.  However, he's had no luck as yet reproducing what I've
> seen on his Google compute engine test setup running ext4.
> 
> > 
> > I don't understand it; this patch changes the behaviour of buffered reads
> > from waiting on a page with a refcount held to waiting on a page without
> > the refcount held, then starting the lookup from scratch once the page
> > is unlocked.  I find it hard to believe this introduces a /new/ failure.
> > Either it makes an existing failure easier to hit, or there's a subtle
> > bug in the retry logic that I'm not seeing.
> > 
> 
> For keeping Murphy at bay I'm rerunning the bisection from scratch just
> to make sure I come out at the same patch.  The initial bisection looked
> clean, but when dealing with a failure that occurs probabilistically it's
> easy enough to get it wrong.  Is this patch revertable in -rc1 or -rc3?
> Ordinarily I like to do that for confirmation.

Alas, not easily.  I've built a lot on top of it since then.  I could
probably come up with a moral reversion (and will have to if we can't
figure out why it's causing a problem!)

> And there's always the chance that a latent ext4 bug is being hit.

That would also be valuable information to find out.  If this
patch is exposing a latent bug, I can't think what it might be.

> I'd be very happy to run whatever debugging patches you might want, though
> you might want to wait until I've reproduced the bisection result.  The
> offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> running a file system with 4k blocks.

As I expected, but thank you for being willing to run debug patches.
I'll wait for you to confirm the bisection and then work up something
that'll help figure out what's going on.