linux-kernel - Re: [GIT PULL] gfs2 fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220503213524.3273690-1-agruenba@redhat.com>
Date:   Tue,  3 May 2022 23:35:24 +0200
From:   Andreas Gruenbacher <agruenba@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andreas Gruenbacher <agruenba@...hat.com>,
        Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <dchinner@...hat.com>,
        cluster-devel <cluster-devel@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] gfs2 fix

On Tue, May 3, 2022 at 3:30 PM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> > On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> > > >
> > > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > > removed, the corruption goes away.
> > >
> > > I looked some more at this on and off, and ended up even more confused.
> > >
> > > For some reason, I'd mostly looked at the read case, because I had
> > > mis-read some of your emails and thought it was the buffered reads
> > > that caused problems.
> > >
> > > Then I went back more carefully, and realized you had always said
> > > gfs2_file_buffered_write() was where the issues happened, and looked
> > > at that path more, and that confused me even *MORE*.
> > >
> > > Because that case has always done the copy from user space with page
> > > faults disabled, because of the traditional deadlock with reading from
> > > user space while holding the page lock on the target page cache page.
> > >
> > > So that is not really about the new deadlock with filesystem locks,
> > > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > > for buffered I/O").
> > >
> > > So now that I'm looking at the right function (maybe) I'm going "huh",
> > > because it's none of the complex cases that would seem to fail, it's
> > > literally just the fault_in_iov_iter_readable() that we've always done
> > > in iomap_write_iter() that presumably starts failing.
> > >
> > > But *that* old code seems bogus too. It's doing
> > >
> > >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> > >                         status = -EFAULT;
> > >                         break;
> > >                 }
> > >
> > > which on the face of it is sane: it's saying "if we can't fault in any
> > > bytes, then stop trying".
> > >
> > > And it's good, and correct, but it does leave one case open.
> > >
> > > Because what if the result is "we can fault things in _partially_"?
> > >
> > > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> > >
> > > Now, with a bug-free filesystem, this really shouldn't matter, since
> > > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > > but this does mean that one fundamental thing that that commit
> > > 00bfe02f4796 changed is that it basically disabled that
> > > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > > and now potentially only faults in a small area.
> > >
> > > That, in turn, means that in practice it *used* to do "write_end()"
> > > with a fully successful range, ie when it did that
> > >
> > >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> > >                                                 page, fsdata);
> > >
> > > then "bytes" and "copied" were the same.
> > >
> > > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > > half-way instead of bringing the next page in, and then that
> > > ->write_begin() to ->write_end() sequence will see the copy in the
> > > middle failing half-way too, and you'll have that write_end()
> > > condition with the write _partially_ succeeding.
> > >
> > > Which is the complex case for write_end() that you practically
> > > speaking never saw before (it *could* happen with a race with swap-out
> > > or similar, but it was not really something you could trigger in real
> > > life.
> > >
> > > And I suspect this is what bites you with gfs2
> > >
> > > To *test* that hypothesis, how about you try this attached patch? The
> > > generic_perform_write() function in mm/filemap.c has the same exact
> > > pattern, but as mentioned, a filesystem really needs to be able to
> > > handle the partial write_end() case, so it's not a *bug* in that code,
> > > but it migth be triggering a bug in gfs2.
> > >
> > > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > > only case this attached patch changes.
> > >
> > > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > > think it may be what then triggers the real bug in gfs2. So this patch
> > > is not wrong per se, but this patch is basically a "hide the problem"
> > > patch, and it would be very interesting to hear if it does indeed fix
> > > your test-case.
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
> >
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
> >
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
> >
> > We're on filesystems with block size == page size, so none of the
> > struct iomap_page uptodata handling should be involved, either.
>
> The rounding around the hole punching in gfs2_iomap_end() looks wrong.
> I'm trying a fix now.

More testing still ongoing, but the following patch seems to fix the
data corruption.  Provided that things go well, I'll send a pull request
tomorrow.

Thanks a lot for the huge amount of help!

Andreas

-

gfs2: Short write fix

When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.

To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation.  It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.

In addition, use round_up() for computing the start of the unused range.

Signed-off-by: Andreas Gruenbacher <agruenba@...hat.com>

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 39080b2d6cf8..6abd044a176d 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1153,14 +1153,12 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
 		/* Deallocate blocks that were just allocated. */
-		loff_t blockmask = i_blocksize(inode) - 1;
-		loff_t end = (pos + length) & ~blockmask;
+		loff_t hstart = round_up(pos + written, i_blocksize(inode));
+		loff_t hend = iomap->offset + iomap->length;
 
-		pos = (pos + written + blockmask) & ~blockmask;
-		if (pos < end) {
-			truncate_pagecache_range(inode, pos, end - 1);
-			punch_hole(ip, pos, end - pos);
-		}
+		truncate_pagecache_range(inode, hstart, hend - 1);
+		if (hstart < hend)
+			punch_hole(ip, hstart, hend - hstart);
 	}
 
 	if (unlikely(!written))