linux-kernel - Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVDsjkQkdmMtbGz7fP053VZo6O1UWtxZpg8wwa7jWVU7w@mail.gmail.com>
Date:	Mon, 17 Sep 2012 17:43:26 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	John Stultz <john.stultz@...aro.org>
Cc:	linux-kernel <linux-kernel@...r.kernel.org>,
	Tony Luck <tony.luck@...el.com>,
	Paul Mackerras <paulus@...ba.org>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Paul Turner <pjt@...gle.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Richard Cochran <richardcochran@...il.com>,
	Prarit Bhargava <prarit@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding
 issue in timekeeping core

On Mon, Sep 17, 2012 at 5:20 PM, John Stultz <john.stultz@...aro.org> wrote:
> On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 17, 2012 at 3:04 PM, John Stultz <john.stultz@...aro.org>
>> wrote:

>>>
>>> Anyway, I'd greatly appreciate any thoughts or feedback on this
>>> approach.
>>
>> I haven't looked in any great detail, but the approach looks sensible
>> and should slow down the vsyscall code.
>
>
> Did you mean "shouldn't"?   Or are you concerned about something
> specifically?
>
> I know you've done quite a bit of tuning on the x86 vdso side, and I don't
> want to wreck that, so I'd appreciate any specific thoughts you have if you
> get a chance to look at it.

I meant "shouldn't".  The thing I use for testing is:
https://gitorious.org/linux-test-utils/linux-clock-tests

>
>
>> That being said, as long as you're playing with this, here are a
>> couple thoughts:
>>
>> 1. The TSC-reading code does this:
>>
>>         ret = (cycle_t)vget_cycles();
>>
>>         last = VVAR(vsyscall_gtod_data).clock.cycle_last;
>>
>>         if (likely(ret >= last))
>>                 return ret;
>>
>> I haven't specifically benchmarked the cost of that branch, but I
>> suspect it's a fairly large fraction of the total cost of
>> vclock_gettime.  IIUC, the point is that there might be a few cycles
>> worth of clock skew even on systems with otherwise usable TSCs, and we
>> don't want a different CPU to return complete garbage if the cycle
>> count is just below cycle_last.
>>
>> A different formulation would avoid the problem: set cycle_last to,
>> say, 100ms *before* the time of the last update_vsyscall, and adjust
>> the wall_time, etc variables accordingly.  That way a few cycles (or
>> anything up to 100ms) or skew won't cause an overflow.  Then you could
>> kill that branch.
>
> Interesting.  So I want to keep the scope of this patch set in check, so I'd
> probably hold off on something like this till later, but this might not be
> too complicated to do in the update_wall_time() function, basically delaying
> the accumulation by some amount of time  Although this would have odd
> effects on things like filesystem timestamps, which provide "the time at the
> last tick", which would then be "the time at the last tick + Xms".  So it
> probably needs more careful consideration.

Fair enough.  One way to do this cleanly might be to have helpers that
effectively read CLOCK_REALTIME_COARSE and use those instead of xtime.

>
>
>
>
>> 2. There's nothing vsyscall-specific about the code in
>> vclock_gettime.c.  In fact, the VVAR macro should work just fine in
>> kernel code.  If you moved all this code into a header, then in-kernel
>> uses could use it, and maybe even other arches could use it.  Last
>> time I checked, it seemed like vclock_gettime was considerably faster
>> than whatever the in-kernel equivalent did.
>
> I like the idea of unifying the implementations, but I'd want to know more
> about why vclock_gettime was faster then the in-kernel getnstimeofday(),
> since it might be due to the more limited locking (we only update vsyscall
> data under the vsyscall lock, where as the timekeeper lock is held for the
> entire execution of update_wall_time()), or some of the optimizations in the
> vsyscall code is focused on providing timespecs to userland, where as
> in-kernel we also have to provide ktime_ts.
>
> If you have any details here, I'd love to hear more. There is some work I
> have planned to address some of these differences, but I'm not sure when
> I'll eventually get to it.
>

I suspect it's because everyone who's benchmarked and tuned the "what
time is it?" code has focused on the vdso version, so the in-kernel
version hasn't gotten much love.  The vclock_gettime code has
basically no abstractions and uses only direct (i.e. explicitly
devirtualized) calls.  ktime_get, OTOH, uses clock->read, which, in
the best case, is read_tsc.  read_tsc (in arch/x86/kernel/tsc.c) is
missing the don't-use-cmov optimization.

The vclock_gettime code also produces a struct timespec without
div/mod, which is probably impossible for any code that goes through
ktime.

In the other direction, AFAICS the in-kernel code is completely
missing the rdtsc_barrier call, which is empirically rather necessary.
 It's pretty easy to detect clock skew without it.

>
>
>> 3. The mul_u64_u32_shr function [1] might show up soon, and it would
>> potentially allow much longer intervals between timekeeping updates.
>> I'm not sure whether the sub-ns formuation would still work, though (I
>> suspect it would with some care).
>>
> Yea, we already have a number of calculations to try to maximize the
> interval while still allowing for fine tuned adjustments. Even so, we are
> somewhat bound by the need to provide tick-granular timestamps for
> filesystem code, etc.  So it may be limited to extending idle times out
> until folks are ok with either going with most expensive clock-granular
> timestamps for filesystem code, or loosing the granularity needs somewhat.

Maybe this could depend on which clocksource is in use.  On recent
Intel boxen, the entire vclock_gettime call takes under 20 ns.  When
called from the kernel, it could be even faster if static keys (or
whatever they're called these days) were used.

FWIW, I don't understand the NTP code at all.  I'm pretty sure I know
what a PLL is, and I don't understand how that has anything to do with
NTP (although it makes sense in a PPS context).  I've always wondered
why the kernel's idea of time was any more complicated than "here's
the clock frequency; in addition, slew the time by X ns (or sub-ns)
over the next Y ns" -- some driver or even userspace code could
program that and the kernel could wake up when it needed to for the
update.  (To make this seamless, vclock_gettime would either need a
branch or the kernel would need to wake up a little early and update
the vsyscall data.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/