lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjMB1-xCOCBtsSMmQuFV9G+vNyCY1O_LsoqOd=0QS4yYg@mail.gmail.com>
Date:   Tue, 26 Apr 2022 16:33:02 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Andreas Gruenbacher <agruenba@...hat.com>
Cc:     cluster-devel <cluster-devel@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] gfs2 fix

On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@...hat.com> wrote:
>
> Btrfs has a comment in that place that reads:
>
>     /* No increment (+=) because iomap returns a cumulative value. */

What a truly horrid interface. But you are triggering repressed
childhood memories:

> That's so that we can complete the tail of an asynchronous write
> asynchronously after a failed page fault and subsequent fault-in.

yeah, that makes me go "I remember something like that".

> That would be great, but applications don't seem to be able to cope
> with short direct writes, so we must turn partial failure into total
> failure here. There's at least one xfstest that checks for that as
> well.

What a complete crock. You're telling me that you did some writes, but
then you can't tell user space that writes happened because that would
confuse things.

Direct-IO is some truly hot disgusting garbage.

Happily it's only used for things like databases that nobody sane
would care about anyway.

Anyway, none of that makes any sense, since you do this:

                ret = gfs2_file_direct_write(iocb, from, &gh);
                if (ret < 0 || !iov_iter_count(from))
                        goto out_unlock;

                iocb->ki_flags |= IOCB_DSYNC;
                buffered = gfs2_file_buffered_write(iocb, from, &gh);

so you're saying that the direct write will never partially succeed,
but then in gfs2_file_write_iter() it very much looks like it's
falling back to buffered writes for that case.

Hmm. Very odd.

I assume this is all due to that

        /* Silently fall back to buffered I/O when writing beyond EOF */
        if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))

thing, so this all makes some perverse kind of sense, but it still
looks like this code just needs some serious serious commentary.

So you *can* have a partial write if you hit the end of the file, and
then you'll continue that partial write with the buffered code.

But an actual _failure_ will not do that, and instead return an error
even if the write was partially done.

But then *some* failures aren't failures at all, and without any
comments do this

        if (ret == -ENOTBLK)
                ret = 0;


And remind me again - this all is meant for applications that
supposedly care about consistency on disk?

And the xfs tests actually have a *test* for that case, to make sure
that nobody can sanely *really* know how much of the write succeeded
if it was a DIO write?

Gotcha.

> > The reason I think I'm full of sh*t is that you say that the problem
> > occurs in gfs2_file_buffered_write(), not in that
> > gfs2_file_direct_write() case.
>
> Right, we're having that issue with buffered writes.

I have to say, compared to all the crazy things I see in the DIO path,
the buffered write path actually looks almost entirely sane.

Of course, gfs2_file_read_iter() counts how many bytes it has read in
a variable called 'written', and gfs2_file_buffered_write() counts the
bytes it has written in a variable called 'read', so "entirely sane"
is all very very relative.

I'm sure there's some good reason (job security?) for all this insanity.

But I now have to go dig my eyes out with a rusty spoon.

But before I do that, I have one more question (I'm going to regret
this, aren't I?):

In gfs2_file_buffered_write(), when it has done that
fault_in_iov_iter_readable(), and it decides that that succeeded, it
does

                        if (gfs2_holder_queued(gh))
                                goto retry_under_glock;
                        if (read && !(iocb->ki_flags & IOCB_DIRECT))
                                goto out_uninit;
                        goto retry;

so if it still has that lock (if I understand correctly), it will always retry.

But if it *lost* the lock, it will retry only if was a direct write,
and return a partial success for a regular write() rather than
continue the write.

Whaa? I'm assuming that this is more of that "direct writes have to
succeed fully or not at all", but according to POSIX *regular* writes
should also succeed fully, unless some error happens.

Losing the lock doesn't sound like an error to me.

And there are a lot of applications that do assume "write to a regular
file didn't complete fully means that the disk is full" etc. Because
that's the traditional meaning.

This doesn't seem to be related to any data corruption issue, but it does smell.

             Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ