linux-kernel - Re: [GIT PULL] gfs2 fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHc6FU7i4GJaSodNX+o44VgWyo1LTPdYkBnypYS3GYa1atYAZA@mail.gmail.com>
Date:   Tue, 3 May 2022 15:30:25 +0200
From:   Andreas Gruenbacher <agruenba@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <dchinner@...hat.com>,
        cluster-devel <cluster-devel@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> > >
> > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > removed, the corruption goes away.
> >
> > I looked some more at this on and off, and ended up even more confused.
> >
> > For some reason, I'd mostly looked at the read case, because I had
> > mis-read some of your emails and thought it was the buffered reads
> > that caused problems.
> >
> > Then I went back more carefully, and realized you had always said
> > gfs2_file_buffered_write() was where the issues happened, and looked
> > at that path more, and that confused me even *MORE*.
> >
> > Because that case has always done the copy from user space with page
> > faults disabled, because of the traditional deadlock with reading from
> > user space while holding the page lock on the target page cache page.
> >
> > So that is not really about the new deadlock with filesystem locks,
> > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > for buffered I/O").
> >
> > So now that I'm looking at the right function (maybe) I'm going "huh",
> > because it's none of the complex cases that would seem to fail, it's
> > literally just the fault_in_iov_iter_readable() that we've always done
> > in iomap_write_iter() that presumably starts failing.
> >
> > But *that* old code seems bogus too. It's doing
> >
> >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> >                         status = -EFAULT;
> >                         break;
> >                 }
> >
> > which on the face of it is sane: it's saying "if we can't fault in any
> > bytes, then stop trying".
> >
> > And it's good, and correct, but it does leave one case open.
> >
> > Because what if the result is "we can fault things in _partially_"?
> >
> > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> >
> > Now, with a bug-free filesystem, this really shouldn't matter, since
> > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > but this does mean that one fundamental thing that that commit
> > 00bfe02f4796 changed is that it basically disabled that
> > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > and now potentially only faults in a small area.
> >
> > That, in turn, means that in practice it *used* to do "write_end()"
> > with a fully successful range, ie when it did that
> >
> >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> >                                                 page, fsdata);
> >
> > then "bytes" and "copied" were the same.
> >
> > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > half-way instead of bringing the next page in, and then that
> > ->write_begin() to ->write_end() sequence will see the copy in the
> > middle failing half-way too, and you'll have that write_end()
> > condition with the write _partially_ succeeding.
> >
> > Which is the complex case for write_end() that you practically
> > speaking never saw before (it *could* happen with a race with swap-out
> > or similar, but it was not really something you could trigger in real
> > life.
> >
> > And I suspect this is what bites you with gfs2
> >
> > To *test* that hypothesis, how about you try this attached patch? The
> > generic_perform_write() function in mm/filemap.c has the same exact
> > pattern, but as mentioned, a filesystem really needs to be able to
> > handle the partial write_end() case, so it's not a *bug* in that code,
> > but it migth be triggering a bug in gfs2.
> >
> > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > only case this attached patch changes.
> >
> > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > think it may be what then triggers the real bug in gfs2. So this patch
> > is not wrong per se, but this patch is basically a "hide the problem"
> > patch, and it would be very interesting to hear if it does indeed fix
> > your test-case.
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.
>
> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.
>
> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.
>
> We're on filesystems with block size == page size, so none of the
> struct iomap_page uptodata handling should be involved, either.

The rounding around the hole punching in gfs2_iomap_end() looks wrong.
I'm trying a fix now.

> > Because that would pinpoint exactly what the bug is.
> >
> > I'm adding Christoph and Darrick as iomap maintainers here to the
> > participants (and Dave Chinner in case he's also the temporary
> > maintainer because Darrick is doing reviews) not because they
> > necessarily care, but just because this test-patch obviously involves
> > the iomap code.
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different. But
> > since I finally think I figured out what might be going on, I decided
> > I'd send this out sooner rather than later.
> >
> > Because this is the first thing that makes me go "Aaahh.. This might
> > explain it".
> >
> >                    Linus
>
> Thanks,
> Andreas