linux-kernel - Re: [GIT PULL] gfs2 fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whL74iP6v2P+OafGO0H72ag4wt42k+Kc_01boLP8aqUNQ@mail.gmail.com>
Date:   Mon, 2 May 2022 11:31:58 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Andreas Gruenbacher <agruenba@...hat.com>,
        Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <dchinner@...hat.com>
Cc:     cluster-devel <cluster-devel@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] gfs2 fix

On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher
<agruenba@...hat.com> wrote:
>
> Yes, but note that it's gfs2_file_buffered_write() that fails. When
> the pagefault_disable/enable() around iomap_file_buffered_write() is
> removed, the corruption goes away.

I looked some more at this on and off, and ended up even more confused.

For some reason, I'd mostly looked at the read case, because I had
mis-read some of your emails and thought it was the buffered reads
that caused problems.

Then I went back more carefully, and realized you had always said
gfs2_file_buffered_write() was where the issues happened, and looked
at that path more, and that confused me even *MORE*.

Because that case has always done the copy from user space with page
faults disabled, because of the traditional deadlock with reading from
user space while holding the page lock on the target page cache page.

So that is not really about the new deadlock with filesystem locks,
that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
for buffered I/O").

So now that I'm looking at the right function (maybe) I'm going "huh",
because it's none of the complex cases that would seem to fail, it's
literally just the fault_in_iov_iter_readable() that we've always done
in iomap_write_iter() that presumably starts failing.

But *that* old code seems bogus too. It's doing

                if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
                        status = -EFAULT;
                        break;
                }

which on the face of it is sane: it's saying "if we can't fault in any
bytes, then stop trying".

And it's good, and correct, but it does leave one case open.

Because what if the result is "we can fault things in _partially_"?

The code blithely goes on and tries to do the whole 'bytes' range _anyway_.

Now, with a bug-free filesystem, this really shouldn't matter, since
the later copy_page_from_iter_atomic() thing should then DTRT anyway,
but this does mean that one fundamental thing that that commit
00bfe02f4796 changed is that it basically disabled that
fault_in_iov_iter_readable() that *used* to fault in the whole range,
and now potentially only faults in a small area.

That, in turn, means that in practice it *used* to do "write_end()"
with a fully successful range, ie when it did that

                status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                                page, fsdata);

then "bytes" and "copied" were the same.

But now that commit 00bfe02f4796 added the "disable_pagefault()"
around the whole thing, fault_in_iov_iter_readable() will easily fail
half-way instead of bringing the next page in, and then that
->write_begin() to ->write_end() sequence will see the copy in the
middle failing half-way too, and you'll have that write_end()
condition with the write _partially_ succeeding.

Which is the complex case for write_end() that you practically
speaking never saw before (it *could* happen with a race with swap-out
or similar, but it was not really something you could trigger in real
life.

And I suspect this is what bites you with gfs2

To *test* that hypothesis, how about you try this attached patch? The
generic_perform_write() function in mm/filemap.c has the same exact
pattern, but as mentioned, a filesystem really needs to be able to
handle the partial write_end() case, so it's not a *bug* in that code,
but it migth be triggering a bug in gfs2.

And gfs2 only uses the iomap_write_iter() case, I think. So that's the
only case this attached patch changes.

Again - I think the unpatched iomap_write_iter() code is fine, but I
think it may be what then triggers the real bug in gfs2. So this patch
is not wrong per se, but this patch is basically a "hide the problem"
patch, and it would be very interesting to hear if it does indeed fix
your test-case.

Because that would pinpoint exactly what the bug is.

I'm adding Christoph and Darrick as iomap maintainers here to the
participants (and Dave Chinner in case he's also the temporary
maintainer because Darrick is doing reviews) not because they
necessarily care, but just because this test-patch obviously involves
the iomap code.

NOTE! This patch is entirely untested. I also didn't actually yet go
look at what gfs2 does when 'bytes' and 'copied' are different. But
since I finally think I figured out what might be going on, I decided
I'd send this out sooner rather than later.

Because this is the first thing that makes me go "Aaahh.. This might
explain it".

                   Linus

View attachment "patch.diff" of type "text/x-patch" (906 bytes)