[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ee3b65480b227102c04272d2219f366c65a14f3.camel@kernel.org>
Date: Mon, 25 Sep 2023 07:22:56 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Amir Goldstein <amir73il@...il.com>
Cc: Christian Brauner <brauner@...nel.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Jan Kara <jack@...e.cz>, "Darrick J. Wong" <djwong@...nel.org>
Subject: Re: [GIT PULL v2] timestamp fixes
On Sat, 2023-09-23 at 10:48 -0700, Linus Torvalds wrote:
> On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@...il.com> wrote:
> >
> > Apparently, they are willing to handle the "year 2486" issue ;)
>
> Well, we could certainly do the same at the VFS layer.
>
> But I suspect 10ns resolution is entirely overkill, since on a lot of
> platforms you don't even have timers with that resolution.
>
> I feel like 100ns is a much more reasonable resolution, and is quite
> close to a single system call (think "one thousand cycles at 10GHz").
>
> > But the resolution change is counter to the purpose of multigrain
> > timestamps - if two syscalls updated the same or two different inodes
> > within a 100ns tick, apparently, there are some workloads that
> > care to know about it and fs needs to store this information persistently.
>
> Those workloads are broken garbage, and we should *not* use that kind
> of sh*t to decide on VFS internals.
>
> Honestly, if the main reason for the multigrain resolution is
> something like that, I think we should forget about MG *entirely*.
> Somebody needs to be told to get their act together.
>
As I noted in the other thread, the primary reason for this was to fix
XFS's change cookie without having to rev the on-disk format. If we
could also present fine-grained timestamps to userland and nfsd, then
that would also fix a lot of cache-coherency problems with NFSv3, and
may also help some workloads which depend on comparing timestamps
between files. That'd be a wonderful bonus, but I'm not going to lose
too much sleep if we can't make that work.
> We have *never* guaranteed nanosecond resolution on timestamps, and I
> think we should put our foot down and say that we never will.
>
> Partly because we have platforms where that kind of timer resolution
> just does not exist.
>
> Partly because it's stupid to expect that kind of resolution anyway.
>
> And partly because any load that assumes that kind of resolution is
> already broken.
>
> End result: we should ABSOLUTELY NOT have as a target to support some
> insane resolution.
>
> 100ns resolution for file access times is - and I'll happily go down
> in history for saying this - enough for anybody.
>
> If you need finer resolution than that, you'd better do it yourself in
> user space.
>
> And no, this is not a "but some day we'll have terahertz CPU's and
> 100ns is an eternity". Moore's law is dead, we're not going to see
> terahertz CPUs, and people who say "but quantum" have bought into a
> technological fairytale.
>
> 100ns is plenty, and has the advantage of having a very safe range.
>
The catch here is that we have at least some testcases that do things
like set specific values in the mtime and atime, and then test that the
same value is retrievable.
Are we OK with breaking those? If we can always say that the stored
resolution is X and that even values that are explicitly set get
truncated then the v8 set I sent on Friday may be ok.
Of course, that set truncates the values at jiffies granularity (~4ms on
my box). That's well above 100ns, so it's possible that's too coarse for
us to handwave this problem away.
> That said, we don't have to do powers-of-ten. In fact, in many ways,
> it would probably be a good idea to think of the fractional seconds in
> powers of two. That tends to make it cheaper to do conversions,
> without having to do a full 64-bit divide (a constant divide turns
> into a fancy multiply, but it's still painful on 32-bit
> architectures).
>
> So, for example, we could easily make the format be a fixed-point
> format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
> which gives us about 30ns resolution, and a range of almost 9000
> years. Which is nice, in how it covers all of written history and all
> four-digit years (we'd keep the 1970 base).
>
> And 30ns resolution really *is* pretty much the limit of a single
> system call. I could *wish* we had system calls that fast, or CPU's
> that fast. Not the case right now, and sadly doesn't seem to be the
> case in the forseeable future - if ever - either. It would be a really
> good problem to have.
>
> And the nice thing about that would be that conversion to timespec64
> would be fairly straightforward:
>
> struct timespec64 to_timespec(fstime_t fstime)
> {
> struct timespec64 res;
> unsigned int frac;
>
> frac = fstime & 0x1ffffffu;
> res.tv_sec = fstime >> 25;
> res.tv_nsec = frac * 1000000000ull >> 25;
> return res;
> }
>
> fstime_t to_fstime(struct timespec64 a)
> {
> fstime_t sec = (fstime_t) a.tv_sec << 25;
> unsigned frac = a.tv_nsec;
>
> frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
> return sec | frac;
> }
>
> and both of those generate good code (that large divide by a constant
> in to_fstime() is not great, but the compiler can turn it into a
> multiply).
>
> The above could be improved upon (nicer rounding and overflow
> handling, and a few modifications to generate even nicer code), but
> it's not horrendous as-is. On x86-64, to_timespec becomes a very
> reasonable
>
> movq %rdi, %rax
> andl $33554431, %edi
> imulq $1000000000, %rdi, %rdx
> sarq $25, %rax
> shrq $25, %rdx
>
> and to some degree that's the critical function (that code would show
> up in 'stat()').
>
> Of course, I might have screwed up the above conversion functions,
> they are untested garbage, but they look close enough to being in the
> right ballpark.
>
> Anyway, we really need to push back at any crazies who say "I want
> nanosecond resolution, because I'm special and my mother said so".
>
Yeah if we we're going to establish a floor granularity for timestamps
above 1ns, then making it a power-of-two factor would probably be a good
thing. These calculations are done a _lot_ so we really do want them to
be efficient.
--
Jeff Layton <jlayton@...nel.org>
Powered by blists - more mailing lists