linux-kernel - Re: [RFC][PATCH] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 25 May 2017 22:03:22 +1000
From:   Paul Mackerras <paulus@...abs.org>
To:     John Stultz <john.stultz@...aro.org>
Cc:     Michael Ellerman <mpe@...erman.id.au>,
        lkml <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Miroslav Lichvar <mlichvar@...hat.com>,
        Richard Cochran <richardcochran@...il.com>,
        Prarit Bhargava <prarit@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Anton Blanchard <anton@...ba.org>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Tony Luck <tony.luck@...el.com>,
        Fenghua Yu <fenghua.yu@...el.com>
Subject: Re: [RFC][PATCH] time: Add warning about imminent deprecation of
 CONFIG_GENERIC_TIME_VSYSCALL_OLD

On Mon, May 22, 2017 at 12:06:04PM -0700, John Stultz wrote:
> 
> Basically long ago, timekeeping was handled (roughly) like:
> 
> clock_gettime():
>     now = tk->clock->read()
>     offset_ns = ((now - tk->cycle_last) * tk->clock->mult) >> tk->clock->shift;
>     return timespec_add_ns(tk->xtime, offset_ns);
> 
> But since for error handling use sub-ns precision, combined with that
> for update performance, we accumulate in fixed intervals, there are
> situations where in the update, we could accumulate half of a
> nanosecond into the base tk->xtime value and leaving half of a
> nanosecond in the offset.   This caused the split nanosecond to be
> truncated out by the math, causing 1ns discontinuities.
> 
> So to address this, we came up with sort of a hack, which when we
> accumulate rounds up that partial nanosecond, and adds the amount we
> rounded up to the error (which will cause the freq correction code to
> slow the clock down slightly). This is the code that is now done in
> the old_vsyscall_fixup() logic.
> 
> Unfortunately this fix (which generates up to a nanosecond of error
> per tick) then made the freq correction code do more work and made it
> more difficult to have a stable clock.
> 
> So we went for a more proper fix, which was to properly handle the
> sub-nanosecond portion of the timekeeping throughout the logic, doing
> the truncation last.
> 
> clock_gettime():
>     now = tk->clock->read()
>     ret.tv_sec = tk->xtime_sec;
>     offset_sns = (now - tk->cycle_last) * tk->clock->mult;
>     ret.tv_nsec = (offset_sns + tk->tkr_mono.xtime_nsec) >> tk->clock->shift;
>     return ret;
> 
> So in the above, we now use the tk->tkr_mono.xtime_nsec (which despite
> its unfortunate name, stores the accumulated shifted nanoseconds), and
> add it to the (current_cycle_delta * clock->mult), then we do the
> shift last to preserve as much precision as we can.
> 
> Unfortunately we need all the reader code to do the same, which wasn't
> easy to transition in some cases. So we provided the
> CONFIG_GENERIC_TIME_VSYSCALL_OLD option to preserve the old round-up
> behavior while arch maintainers could make the transition.

The VDSO code on PPC computes the offset in units of 2^-32 seconds,
not nanoseconds, because that makes it easy to handle the split of the
offset into whole seconds and fractional seconds (which is handled in
the generic code by the slightly icky __iter_div_u64_rem function),
and also means that we can use PPC's instruction that computes
(a * b) >> 32 to convert the fractional part to either nanoseconds or
microseconds without doing a division.

I could pretty easily change the computations done at update_vsyscall
time to convert the tk->tkr_mono.xtime_nsec value to units of 2^-32
seconds for use by the VDSO.  That would mean we would no longer need
CONFIG_GENERIC_TIME_VSYSCALL_OLD, and would give us values returned by
the VDSO gettimeofday() and clock_gettime() that should be within
about 1/4 ns of what the generic code in the kernel would give (on
average, I mean, given that the results have at best nanosecond
resolution).  Since that corresponds to about 1 CPU clock cycle, it
seems like it should be good enough.

Alternatively I could make the VDSO computations use a smaller unit
(maybe 2^-36 or 2^-40 seconds), or else rewrite them to use exactly
the same algorithm as the generic code - which would be a bigger
change, and would mean having to do an iterative division.

So, do you think the 1/4 ns resolution is good enough for the VDSO
computations?

Paul.