linux-kernel - Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wj6wy6tNUQm6EtgxfE_J229y1DthpCguqQfTej71yiJXw@mail.gmail.com>
Date:   Wed, 1 Nov 2023 10:10:03 -1000
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Jan Kara <jack@...e.cz>
Cc:     Dave Chinner <david@...morbit.com>,
        Jeff Layton <jlayton@...nel.org>,
        Amir Goldstein <amir73il@...il.com>,
        Kent Overstreet <kent.overstreet@...ux.dev>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephen Boyd <sboyd@...nel.org>,
        Chandan Babu R <chandan.babu@...cle.com>,
        "Darrick J. Wong" <djwong@...nel.org>,
        "Theodore Ts'o" <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jan Kara <jack@...e.de>, David Howells <dhowells@...hat.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mm@...ck.org,
        linux-nfs@...r.kernel.org
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain
 timestamp handing

On Wed, 1 Nov 2023 at 00:16, Jan Kara <jack@...e.cz> wrote:
>
> OK, but is this compatible with the current XFS behavior? AFAICS currently
> XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will
> have some mostly random garbage in low bits of the ctime.

I really *really* don't think we can use ctime as a "i_version"
replacement. The whole fine-granularity patches were well-intentioned,
but I do think they were broken.

Note that we can't use ctime as a "i_version" replacement for other
reasons too - you have filesystems like FAT - which people do want to
export - that have a single-second (or is it 2s?) granularity in
reality, even though they report a 1ns value in s_time_gran.

But here's a suggestion that people may hate, but that might just work
in practice:

 - get rid of i_version entirely

 - use the "known good" part of ctime as the upper bits of the change
counter (and by "known good" I mean tv_sec - or possibly even "tv_sec
/ 2" if that dim FAT memory of mine is right)

 - make the rule be that ctime is *never* updated for atime updates
(maybe that's already true, I didn't check - maybe it needs a new
mount flag for nfsd)

 - have a per-inode in-memory and vfs-internal (entirely invisible to
filesystems) "ctime modification counter" that is *NOT* a timestamp,
and is *NOT* i_version

 - make the rule be that the "ctime modification counter" is always
zero, *EXCEPT* if
    (a) I_VERSION_QUERIED is set
   AND
    (b) the ctime modification doesn't modify the "known good" part of ctime

so how the "statx change cookie" ends up being "high bits tv_sec of
ctime, low bits ctime modification cookie", and the end result of that
is:

 - if all the reads happen after the last write (common case), then
the low bits will be zero, because I_VERSION_QUERIED wasn't set when
ctime was modified

 - if you do a write *after* a modification, the ctime cookie is
guaranteed to change, because either the known good (sec/2sec) part of
ctime is new, *or* the counter gets updated

 - if the nfs server reboots, the in-memory counter will be cleared
again, and so the change cookie will cause client cache invalidations,
but *only* for those "ctime changed in the same second _after_
somebody did a read".

 - any long-time caches of files that don't get modified are all fine,
because they will have those low bits zero and depend on just the
stable part of ctime that works across filesystems. So there should be
no nasty thundering herd issues on long-lived caches on lots of
clients if the server reboots, or atime updates every 24 hours or
anything like that.

and note that *NONE* of this requires any filesystem involvement
(except for the rule of "no atime changes ever impact ctime", which
may or may not already be true).

The filesystem does *not* know about that modification counter,
there's no new on-disk stable information.

It's entirely possible that I'm missing something obvious, but the
above sounds to me like the only time you'd have stale invalidations
is really the (unusual) case of having writes after cached reads, and
then a reboot.

We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead
replace it with logic in inode_set_ctime_current() that basically does

 - if the stable part of ctime changes, clear the new 32-bit counter

 - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter

 - otherwise, increment the new 32-bit counter

and then the STATX_CHANGE_COOKIE code basically just returns

   (stable part of ctime << 32) + new 32-bit counter

(and again, the "stable part of ctime" is either just tv_sec, or it's
"tv_sec >> 1" or whatever).

The above does not expose *any* changes to timestamps to users, and
should work across a wide variety of filesystems, without requiring
any special code from the filesystem itself.

And now please all jump on me and say "No, Linus, that won't work, because XYZ".

Because it is *entirely* possible that I missed something truly
fundamental, and the above is completely broken for some obvious
reason that I just didn't think of.

             Linus