linux-kernel - [PATCH RFC 0/9] fs: multigrain timestamps (redux)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231018-mgtime-v1-0-4a7a97b1f482@kernel.org>
Date:   Wed, 18 Oct 2023 13:41:07 -0400
From:   Jeff Layton <jlayton@...nel.org>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        Dave Chinner <david@...morbit.com>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Amir Goldstein <amir73il@...il.com>, Jan Kara <jack@...e.de>,
        David Howells <dhowells@...hat.com>
Cc:     linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
        linux-nfs@...r.kernel.org, Jeff Layton <jlayton@...nel.org>
Subject: [PATCH RFC 0/9] fs: multigrain timestamps (redux)

The VFS always uses coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this coarseness has always been an issue when we're
exporting via NFSv3, which relies on timestamps to validate caches. A
lot of changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps (e.g
backup applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. The idea is to use an unused bit in the ctime's
tv_nsec field to mark when the mtime or ctime has been queried via
getattr. Once that has been marked, the next m/ctime update will use a
fine-grained timestamp.

The original merge of multigrain timestamps for v6.6 had to be reverted,
as a file with a coarse-grained timestamp could incorrectly appear to be
modified before a file with a fine-grained timestamp, when that wasn't
the case.

This revision solves that problem by making it so that when a
fine-grained timespec64 is handed out, that that value becomes the floor
for further coarse-grained timespec64 fetches. This requires new
timekeeper interfaces with a potential downside: when a file is
stamped with a fine-grained timestamp, it has to (briefly) take the
global timekeeper spinlock.

Because of that, this set takes greater pains to avoid issuing new
fine-grained timestamps when possible. A fine-grained timestamp is now
only required if the current mtime or ctime have been fetched for a
getattr, and the next coarse-grained tick has not happened yet. For any
other case, a coarse-grained timestamp is fine, and that is done using
the seqcount.

In order to get some hard numbers about how often the lock would be
taken, I've added a couple of percpu counters and a debugfs file for
tracking both types of multigrain timekeeper fetches.

With this, I did a kdevops fstests run on xfs (CRC mode). I ran "make
fstests-baseline" and then immediately grabbed the counter values, and
calcuated the percentage:

$ time make fstests-baseline
real    324m17.337s
user    27m23.213s
sys     2m40.313s

fine            3059498
coarse          383848171
pct fine        .79075661

Next I did a kdevops fstests run with NFS. One server serving 3 clients
(v4.2, v4.0 and v3). Again, timed "make fstests-baseline" and then
grabbed the multigrain counters from the NFS server:

$ time make fstests-baseline
real    181m57.585s
user    16m8.266s
sys     1m45.864s

fine            8137657
coarse          44726007
pct fine        15.393668

We can't run as many tests on nfs as xfs, so the run is shorter. nfsd is
a very getattr-heavy workload, and the clients aggressively coalesce
writes, so this is probably something of a pessimal case for number of
fine-grained timestamps over time.

At this point I'm mainly wondering whether (briefly) taking the
timekeeper spinlock in this codepath is unreasonable. It does very
little work under it, so I'm hoping the impact would be unmeasurable for
most workloads.

Side Q: what's the best tool for measuring spinlock contention? It'd be
interesting to see how often (and how long) we end up spinning on this
lock under different workloads.

Note that some of the patches in the series are virtually identical to
the ones before. I stripped the prior Reviewed-by/Acked-by tags though
since the underlying infrastructure has changed a bit.

Comments and suggestions welcome.

Signed-off-by: Jeff Layton <jlayton@...nel.org>
---
Jeff Layton (9):
      fs: switch timespec64 fields in inode to discrete integers
      timekeeping: new interfaces for multigrain timestamp handing
      timekeeping: add new debugfs file to count multigrain timestamps
      fs: add infrastructure for multigrain timestamps
      fs: have setattr_copy handle multigrain timestamps appropriately
      xfs: switch to multigrain timestamps
      ext4: switch to multigrain timestamps
      btrfs: convert to multigrain timestamps
      tmpfs: add support for multigrain timestamps

 fs/attr.c                           |  52 ++++++++++++++--
 fs/btrfs/file.c                     |  25 ++------
 fs/btrfs/super.c                    |   5 +-
 fs/ext4/super.c                     |   2 +-
 fs/inode.c                          |  70 ++++++++++++++++++++-
 fs/stat.c                           |  41 ++++++++++++-
 fs/xfs/libxfs/xfs_trans_inode.c     |   6 +-
 fs/xfs/xfs_iops.c                   |  10 +--
 fs/xfs/xfs_super.c                  |   2 +-
 include/linux/fs.h                  |  85 ++++++++++++++++++--------
 include/linux/timekeeper_internal.h |   2 +
 include/linux/timekeeping.h         |   4 ++
 kernel/time/timekeeping.c           | 117 ++++++++++++++++++++++++++++++++++++
 mm/shmem.c                          |   2 +-
 14 files changed, 352 insertions(+), 71 deletions(-)
---
base-commit: 12cd44023651666bd44baa36a5c999698890debb
change-id: 20231016-mgtime-fe3ea75c6f59

Best regards,
-- 
Jeff Layton <jlayton@...nel.org>