lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210318221620.GW3420@casper.infradead.org>
Date:   Thu, 18 Mar 2021 22:16:20 +0000
From:   Matthew Wilcox <willy@...radead.org>
To:     Eric Whitney <enwlinux@...il.com>
Cc:     linux-ext4@...r.kernel.org, tytso@....edu
Subject: Re: generic/418 regression seen on 5.12-rc3

On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote:
> * Matthew Wilcox <willy@...radead.org>:
> > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote:
> > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to
> > > time when run on 5.12-rc3 and 5.12-rc1 kernels.  This first occurred when
> > > running the 1k test case using kvm-xfstests.  I was then able to bisect the
> > > failure to a patch landed in the -rc1 merge window:
> > > 
> > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page
> > 
> > Thanks for letting me know.  This failure is new to me.
> 
> Sure - it's useful to know that it's new to you.  Ted said he's also going
> to test XFS with a large number of generic/418 trials which would be a
> useful comparison.  However, he's had no luck as yet reproducing what I've
> seen on his Google compute engine test setup running ext4.
> 
> > 
> > I don't understand it; this patch changes the behaviour of buffered reads
> > from waiting on a page with a refcount held to waiting on a page without
> > the refcount held, then starting the lookup from scratch once the page
> > is unlocked.  I find it hard to believe this introduces a /new/ failure.
> > Either it makes an existing failure easier to hit, or there's a subtle
> > bug in the retry logic that I'm not seeing.
> > 
> 
> For keeping Murphy at bay I'm rerunning the bisection from scratch just
> to make sure I come out at the same patch.  The initial bisection looked
> clean, but when dealing with a failure that occurs probabilistically it's
> easy enough to get it wrong.  Is this patch revertable in -rc1 or -rc3?
> Ordinarily I like to do that for confirmation.

Alas, not easily.  I've built a lot on top of it since then.  I could
probably come up with a moral reversion (and will have to if we can't
figure out why it's causing a problem!)

> And there's always the chance that a latent ext4 bug is being hit.

That would also be valuable information to find out.  If this
patch is exposing a latent bug, I can't think what it might be.

> I'd be very happy to run whatever debugging patches you might want, though
> you might want to wait until I've reproduced the bisection result.  The
> offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when
> running a file system with 4k blocks.

As I expected, but thank you for being willing to run debug patches.
I'll wait for you to confirm the bisection and then work up something
that'll help figure out what's going on.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ