linux-kernel - Re: [GIT PULL] gfs2 fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHc6FU77KGn76B4ieu9Tn895deK-1yV4y=8ou4gTfUf=7C-4XQ@mail.gmail.com>
Date:   Tue, 3 May 2022 10:56:20 +0200
From:   Andreas Gruenbacher <agruenba@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <dchinner@...hat.com>,
        cluster-devel <cluster-devel@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] gfs2 fix

On Mon, May 2, 2022 at 8:32 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> >
> > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > removed, the corruption goes away.
>
> I looked some more at this on and off, and ended up even more confused.
>
> For some reason, I'd mostly looked at the read case, because I had
> mis-read some of your emails and thought it was the buffered reads
> that caused problems.
>
> Then I went back more carefully, and realized you had always said
> gfs2_file_buffered_write() was where the issues happened, and looked
> at that path more, and that confused me even *MORE*.
>
> Because that case has always done the copy from user space with page
> faults disabled, because of the traditional deadlock with reading from
> user space while holding the page lock on the target page cache page.
>
> So that is not really about the new deadlock with filesystem locks,
> that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> for buffered I/O").
>
> So now that I'm looking at the right function (maybe) I'm going "huh",
> because it's none of the complex cases that would seem to fail, it's
> literally just the fault_in_iov_iter_readable() that we've always done
> in iomap_write_iter() that presumably starts failing.
>
> But *that* old code seems bogus too. It's doing
>
>                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
>                         status = -EFAULT;
>                         break;
>                 }
>
> which on the face of it is sane: it's saying "if we can't fault in any
> bytes, then stop trying".
>
> And it's good, and correct, but it does leave one case open.
>
> Because what if the result is "we can fault things in _partially_"?
>
> The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
>
> Now, with a bug-free filesystem, this really shouldn't matter, since
> the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> but this does mean that one fundamental thing that that commit
> 00bfe02f4796 changed is that it basically disabled that
> fault_in_iov_iter_readable() that *used* to fault in the whole range,
> and now potentially only faults in a small area.
>
> That, in turn, means that in practice it *used* to do "write_end()"
> with a fully successful range, ie when it did that
>
>                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
>                                                 page, fsdata);
>
> then "bytes" and "copied" were the same.
>
> But now that commit 00bfe02f4796 added the "disable_pagefault()"
> around the whole thing, fault_in_iov_iter_readable() will easily fail
> half-way instead of bringing the next page in, and then that
> ->write_begin() to ->write_end() sequence will see the copy in the
> middle failing half-way too, and you'll have that write_end()
> condition with the write _partially_ succeeding.
>
> Which is the complex case for write_end() that you practically
> speaking never saw before (it *could* happen with a race with swap-out
> or similar, but it was not really something you could trigger in real
> life.
>
> And I suspect this is what bites you with gfs2
>
> To *test* that hypothesis, how about you try this attached patch? The
> generic_perform_write() function in mm/filemap.c has the same exact
> pattern, but as mentioned, a filesystem really needs to be able to
> handle the partial write_end() case, so it's not a *bug* in that code,
> but it migth be triggering a bug in gfs2.
>
> And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> only case this attached patch changes.
>
> Again - I think the unpatched iomap_write_iter() code is fine, but I
> think it may be what then triggers the real bug in gfs2. So this patch
> is not wrong per se, but this patch is basically a "hide the problem"
> patch, and it would be very interesting to hear if it does indeed fix
> your test-case.

We still get data corruption with the patch applied. The
WARN_ON_ONCE(!bytes) doesn't trigger.

As an additional experiment, I've added code to check the iterator
position that iomap_file_buffered_write() returns, and it's all
looking good as well: an iov_iter_advance(orig_from, written) from the
original position always gets us to the same iterator.

This points at gfs2 getting things wrong after a short write, for
example, marking a page / folio uptodate that isn't. But the uptodate
handling happens at the iomap layer, so this doesn't leave me with an
immediate suspect.

We're on filesystems with block size == page size, so none of the
struct iomap_page uptodata handling should be involved, either.

> Because that would pinpoint exactly what the bug is.
>
> I'm adding Christoph and Darrick as iomap maintainers here to the
> participants (and Dave Chinner in case he's also the temporary
> maintainer because Darrick is doing reviews) not because they
> necessarily care, but just because this test-patch obviously involves
> the iomap code.
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different. But
> since I finally think I figured out what might be going on, I decided
> I'd send this out sooner rather than later.
>
> Because this is the first thing that makes me go "Aaahh.. This might
> explain it".
>
>                    Linus

Thanks,
Andreas