linux-kernel - Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <584529.1615202921@warthog.procyon.org.uk>
Date:   Mon, 08 Mar 2021 11:28:41 +0000
From:   David Howells <dhowells@...hat.com>
To:     Amir Goldstein <amir73il@...il.com>
Cc:     dhowells@...hat.com, linux-cachefs@...hat.com,
        Jeff Layton <jlayton@...hat.com>,
        David Wysochanski <dwysocha@...hat.com>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        "J. Bruce Fields" <bfields@...ldses.org>,
        Christoph Hellwig <hch@...radead.org>,
        Dave Chinner <dchinner@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        linux-afs@...ts.infradead.org,
        Linux NFS Mailing List <linux-nfs@...r.kernel.org>,
        CIFS <linux-cifs@...r.kernel.org>,
        ceph-devel <ceph-devel@...r.kernel.org>,
        v9fs-developer@...ts.sourceforge.net,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Miklos Szeredi <miklos@...redi.hu>
Subject: Metadata writtenback notification? -- was Re: fscache: Redesigning the on-disk cache

Amir Goldstein <amir73il@...il.com> wrote:

> > But after I've written and sync'd the data, I set the xattr to mark the
> > file not open.  At the moment I'm doing this too lazily, only doing it
> > when a netfs file gets evicted or when the cache gets withdrawn, but I
> > really need to add a queue of objects to be sealed as they're closed.  The
> > balance is working out how often to do the sealing as something like a
> > shell script can do a lot of consecutive open/write/close ops.
> 
> You could add an internal vfs API wait_for_multiple_inodes_to_be_synced().
> For example, xfs keeps the "LSN" on each inode, so once the transaction
> with some LSN has been committed, all the relevant inodes, if not dirty, can
> be declared as synced, without having to call fsync() on any file and without
> having to force transaction commit or any IO at all.
> 
> Since fscache takes care of submitting the IO, and it shouldn't care about any
> specific time that the data/metadata hits the disk(?), you can make use of the
> existing periodic writeback and rolling transaction commit and only ever need
> to wait for that to happen before marking cache files "closed".
> 
> There was a discussion about fsyncing a range of files on LSFMM [1].
> In the last comment on the article dchinner argues why we already have that
> API (and now also with io_uring(), but AFAIK, we do not have a useful
> wait_for_sync() API. And it doesn't need to be exposed to userspace at all.
> 
> [1] https://lwn.net/Articles/789024/

This sounds like an interesting idea.  Actually, what I probably want is a
notification to say that a particular object has been completely sync'd to
disk, metadata and all.

I'm not sure that io_uring is particularly usable from within the kernel,
though.

> If I were you, I would try to avoid re-implementing a journaled filesystem or
> a database for fscache and try to make use of crash consistency guarantees
> that filesystems already provide.
> Namely, use the data dependency already provided by temp files.
> It doesn't need to be one temp file per cached file.
> 
> Always easier said than done ;-)

Yes.

There are a number of considerations I have to deal with, and they're somewhat
at odds with each other:

 (1) I need to record what data I have stored from a file.

 (2) I need to record where I stored the data.

 (3) I need to make sure that I don't see old data.

 (4) I need to make sure that I don't see data in the wrong file.

 (5) I need to make sure I lose as little as possible on a crash.

 (6) I want to be able to record what changes were made in the event we're
     disconnected from the server.

For my fscache-iter branch, (1) is done with a map in an xattr, but I only
cache up to 1G in a file at the moment; (2), (4) and, to some extent (5), are
handled by the backing fs; (3) is handled by tagging the file and storing
coherency data in in an xattr (though tmpfiles are used on full invalidation).
(6) is not yet supported.

For upstream, (1), (2), (4) and to some extent (5) are handled through the
backing fs.  (3) is handled by storing coherency data in an xattr and
truncating the file on invalidation; (6) is not yet supported.

However, there are some performance problems are arising in my fscache-iter
branch:

 (1) It's doing a lot of synchronous metadata operations (tmpfile, truncate,
     setxattr).

 (2) It's retaining a lot of open file structs on cache files.  Cachefiles
     opens the file when it's first asked to access it and retains that till
     the cookie is relinquished or the cache withdrawn (the file* doesn't
     contribute to ENFILE/EMFILE but it still eats memory).

     I can mitigate this by closing much sooner, perhaps opening the file for
     each operation - but at the cost of having to spend time doing more opens
     and closes.  What's in upstream gets away without having to do open/close
     for reads because it calls readpage.

     Alternatively, I can have a background file closer - which requires an
     LRU queue.  This could be combined with a file "sealer".

     Deferred writeback on the netfs starting writes to the cache makes this
     more interesting as I have to retain the interest on the cache object
     beyond the netfs file being closed.

 (3) Trimming excess data from the end of the cache file.  The problem with
     using DIO to write to the cache is that the write has to be rounded up to
     a multiple of the backing fs DIO blocksize, but if the file is truncated
     larger, that excess data now becomes part of the file.

     Possibly it's sufficient to just clear the excess page space before
     writing, but that doesn't necessarily stop a writable mmap from
     scribbling on it.

 (4) Committing outstanding cache metadata at cache withdrawal or netfs
     unmount.  I've previously mentioned this: it ends up with a whole slew of
     synchronous metadata changes being committed to the cache in one go
     (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
     can take quite a long time.  The cache needs to be more proactive in
     getting stuff committed as it goes along.

 (5) Attaching to an object requires a pathwalk to it (normally only two
     steps) and then reading various xattrs on it - all synchronous, but can
     be punted to a background threadpool.

Amongst the reasons I was considering moving to an index and a single datafile
is to replace the path-lookup step for each object and the xattr reads to
looking in a single file and to reduce the number of open files in the cache
at any one time to around four.

David