linux-kernel - Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHc6FU4Mx_=qMYOBc0VYdn-paFXKVffq=k72LCJgTWONv9chng@mail.gmail.com>
Date:   Wed, 27 Nov 2019 17:29:36 +0100
From:   Andreas Gruenbacher <agruenba@...hat.com>
To:     Steven Whitehouse <swhiteho@...hat.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andreas Grünbacher <andreas.gruenbacher@...il.com>,
        Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Linux-MM <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Johannes Weiner <hannes@...xchg.org>,
        "cluster-devel@...hat.com" <cluster-devel@...hat.com>,
        Ronnie Sahlberg <lsahlber@...hat.com>,
        Steve French <sfrench@...ba.org>,
        Bob Peterson <rpeterso@...hat.com>
Subject: Re: [PATCH] mm/filemap: do not allocate cache pages beyond end of
 file at read

On Wed, Nov 27, 2019 at 4:42 PM Steven Whitehouse <swhiteho@...hat.com> wrote:
> Hi,
>
> On 25/11/2019 17:05, Linus Torvalds wrote:
> > On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse <swhiteho@...hat.com> wrote:
> >> Linus, is that roughly what you were thinking of?
> > So the concept looks ok, but I don't really like the new flags as they
> > seem to be gfs2-specific rather than generic.
> >
> > That said, I don't _gate_ them either, since they aren't in any
> > critical code sequence, and it's not like they do anything really odd.
> >
> > I still think the whole gfs2 approach is broken. You're magically ok
> > with using stale data from the cache just because it's cached, even if
> > another client might have truncated the file or something.
>
> If another node tries to truncate the file, that will require an
> exclusive glock, and in turn that means the all the other nodes will
> have to drop their glock(s) shared or exclusive. That process
> invalidates the page cache on those nodes, such that any further
> requests on those nodes will find the cache empty and have to call into
> the filesystem.
>
> If a page is truncated on another node, then when the local node gives
> up its glock, after any copying (e.g. for read) has completed then the
> truncate will take place. The local node will then have to reread any
> data relating to new pages or return an error in case the next page to
> be read has vanished due to the truncate. It is a pretty small window,
> and the advantage is that in cases where the page is in cache, we can
> directly use the cached page without having to call into the filesystem
> at all. So it is page atomic in that sense.
>
> The overall aim here is to avoid taking (potentially slow) cluster locks
> when at all possible, yet at the same time deliver close to local fs
> semantics whenever we can. You can think of GFS2's glock concept (at
> least as far as the inodes we are discussing here) as providing a
> combination of (page) cache control and cluster (dlm) locking.
>
> >
> > So you're ok with saying "the file used to be X bytes in size, so
> > we'll just give you this data because we trust that the X is correct".
> >
> > But you're not ok to say "oh, the file used to be X bytes in size, but
> > we don't want to give you a short read because it might not be correct
> > any more".
> >
> > See the disconnect? You trust the size in one situation, but not in another one.
>
> Well we are not trusting the size at all... the original algorithm
> worked entirely off "is this page in cache and uptodate?" and for
> exactly the reason that we know the size in the inode might be out of
> date, if we are not currently holding a glock in either shared or
> exclusive mode. We also know that if there is a page in cache and
> uptodate then we must be holding the glock too.
>
>
> >
> > I also don't really see that you *need* the new flag at all. Since
> > you're doing to do a speculative read and then a real read anyway, and
> > since the only thing that you seem to care about is the file size
> > (because the *data* you will trust if it is cached), then why don't
> > you just use the *existing* generic read, and *IFF* you get a
> > truncated return value, then you go and double-check that the file
> > hasn't changed in size?
> >
> > See what I'm saying? I think gfs2 is being very inconsistent in when
> > it trusts the file size, and I don't see that you even need the new
> > behavior that patch gives, because you might as well just use the
> > existing code (just move the i_size check earlier, and then teach gfs2
> > to double-check the "I didn't get as much as I expected" case).

We can identify short reads, but we won't get information about
readahead back from generic_file_read_iter or filemap_fault. We could
try to work around this with filesystem specific flags for ->readpage
and ->readpages, but that would break down with multiple concurrent
readers in addition to being a real mess. I'm currently out of better
ideas that avoid duplicating the generic code.

> >                   Linus
>
> I'll leave the finer details to Andreas here, since it is his patch, and
> hopefully we can figure out a good path forward. We are perhaps also a
> bit reluctant to go off and (nearly) duplicate code that is already in
> the core vfs library functions, since that often leads to things getting
> out of sync (our implementation of ->writepages is one case where that
> happened in the past) and missing important bug fixes/features in some
> cases. Hopefully though we can iterate on this a bit and come up with
> something which will resolve all the issues,
>
> Steve.

Thanks,
Andreas