linux-kernel - Re: fscache: Redesigning the on-disk cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <517184.1615194835@warthog.procyon.org.uk>
Date:   Mon, 08 Mar 2021 09:13:55 +0000
From:   David Howells <dhowells@...hat.com>
To:     Amir Goldstein <amir73il@...il.com>
Cc:     dhowells@...hat.com, linux-cachefs@...hat.com,
        Jeff Layton <jlayton@...hat.com>,
        David Wysochanski <dwysocha@...hat.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        "J. Bruce Fields" <bfields@...ldses.org>,
        Christoph Hellwig <hch@...radead.org>,
        Dave Chinner <dchinner@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        linux-afs@...ts.infradead.org,
        Linux NFS Mailing List <linux-nfs@...r.kernel.org>,
        CIFS <linux-cifs@...r.kernel.org>,
        ceph-devel <ceph-devel@...r.kernel.org>,
        v9fs-developer@...ts.sourceforge.net,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Miklos Szeredi <miklos@...redi.hu>
Subject: Re: fscache: Redesigning the on-disk cache

Amir Goldstein <amir73il@...il.com> wrote:

> >  (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the
> >       file for every whole operation (which may combine reads and writes).
> 
> I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA
> to keep track of present data, it's hard to distinguish between an
> invalid cached range and a valid "cached hole".

I wasn't exactly intending to permit caching over NFS.  That leads to fun
making sure that the superblock you're caching isn't the one that has the
cache in it.

However, we will need to handle hole-punching being done on a cached netfs,
even if that's just to completely invalidate the cache for that file.

> With ->fiemap() you can at least make the distinction between a non existing
> and an UNWRITTEN extent.

I can't use that for XFS, Ext4 or btrfs, I suspect.  Christoph and Dave's
assertion is that the cache can't rely on the backing filesystem's metadata
because these can arbitrarily insert or remove blocks of zeros to bridge or
split extents.

> You didn't say much about crash consistency or durability requirements of the
> cache. Since cachefiles only syncs the cache on shutdown, I guess you
> rely on the hosting filesystem to provide the required ordering guarantees.

There's an xattr on each file in the cache to record the state.  I use this
mark a cache file "open".  If, when I look up a file, the file is marked open,
it is just discarded at the moment.

Now, there are two types of data stored in the cache: data that has to be
stored as a single complete blob and is replaced as such (e.g. symlinks and
AFS dirs) and data that might be randomly modified (e.g. regular files).

For the former, I have code, though in yet another branch, that writes this in
a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over.

For the latter, that's harder to do as it would require copying the data to
the tmpfile before we're allowed to modify it.  However, if it's possible to
create a tmpfile that's a CoW version of a data file, I could go down that
route.

But after I've written and sync'd the data, I set the xattr to mark the file
not open.  At the moment I'm doing this too lazily, only doing it when a netfs
file gets evicted or when the cache gets withdrawn, but I really need to add a
queue of objects to be sealed as they're closed.  The balance is working out
how often to do the sealing as something like a shell script can do a lot of
consecutive open/write/close ops.

> How does this work with write through network fs cache if the client system
> crashes but the write gets to the server?

The presumption is that the coherency info on the server will change, but
won't get updated in the cache.

> Client system get restart with older cached data because disk caches were
> not flushed before crash. Correct?  Is that case handled? Are the caches
> invalidated on unclean shutdown?

The netfs provides some coherency info for the cache to store.  For AFS, for
example, this is the data version number (though it should probably include
the volume creation time too).  This is stored with the state info in the same
xattr and is only updated when the "open" state is cleared.

When the cache file is reopened, if the coherency info doesn't match what
we're expecting (presumably we queried the server), the file is discarded.

(Note that the coherency info is netfs-specific)

> Anyway, how are those ordering requirements going to be handled when entire
> indexing is in a file? You'd practically need to re-implement a filesystem

Yes, the though has occurred to me too.  I would be implementing a "simple"
filesystem - and we have lots of those:-/.  The most obvious solution is to
use the backing filesystem's metadata - except that that's not possible.

> journal or only write cache updates to a temp file that can be discarded at
> any time?

It might involve keeping a bitmap of "open" blocks.  Those blocks get
invalidated when the cache restarts.  The simplest solution would be to wipe
the entire cache in such a situation, but that goes against one of the
important features I want out of it.

Actually, a journal of open and closed blocks might be better, though all I
really need to store for each block is a 32-bit number.

It's a particular problem if I'm doing DIO to the data storage area but
buffering the changes to the metadata.  Further, the metadata and data might
be on different media, just to add to the complexity.

Another possibility is only to cull blocks when the parent file is culled.
That probably makes more sense as, as long as the file is registered culled on
disk first and I don't reuse the file slot too quickly, I can write to the
data store before updating the metadata.

> If you come up with a useful generic implementation of a "file data
> overlay", overlayfs could also use it for "partial copy up" as well as for
> implementation of address space operations, so please keep that in mind.

I'm trying to implement things so that the netfs does look-aside when reading,
and multi-destination write-back when writing - but the netfs is in the
driving seat and the cache is invisible to the user.  I really want to avoid
overlaying the cache on the netfs so that the cache is the primary access
point.

David