linux-kernel - Re: [RFC][PATCH 4/4] time: Do leapsecond adjustment in gettime fastpaths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.11.1506082030090.4133@nanos>
Date:	Mon, 8 Jun 2015 21:05:49 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	John Stultz <john.stultz@...aro.org>
cc:	Peter Zijlstra <peterz@...radead.org>,
	Richard Cochran <richardcochran@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	lkml <linux-kernel@...r.kernel.org>,
	Prarit Bhargava <prarit@...hat.com>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Jan Kara <jack@...e.cz>, Jiri Bohac <jbohac@...e.cz>,
	Ingo Molnar <mingo@...hat.com>,
	Shuah Khan <shuahkh@....samsung.com>
Subject: Re: [RFC][PATCH 4/4] time: Do leapsecond adjustment in gettime
 fastpaths

On Mon, 8 Jun 2015, John Stultz wrote:
> On Sat, Jun 6, 2015 at 2:44 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
> > On Fri, 5 Jun 2015, Peter Zijlstra wrote:
> >
> >> On Fri, 2015-06-05 at 11:04 +0200, Richard Cochran wrote:
> >> > On Fri, Jun 05, 2015 at 09:29:13AM +0200, Peter Zijlstra wrote:
> >> > > That leaves the question; for who is this exact second edge important?
> >> >
> >> > Distributed applications using the UTC time scale.
> >> >
> >> > Many control applications are done with a 1 millisecond period.
> >> > Having the time wrong by a second for 10 or 100 loops is bad news.
> >>
> >> Firstly I would strongly suggest such applications not use UTC because
> >> of this, I think TAI was invented for just this reason.
> >>
> >> Secondly, how would John's patches help with this? Usespace loops
> >> reading time would be using the VDSO and would still not get the right
> >> time, and timers would be subject to the same IRQ latency that a hrtimer
> >> based leap second insert would, and would still very much not be in-sync
> >> across the cluster.
> >
> > So the only thing which is fixed are in kernel users and therefor
> > hrtimers.
> 
> Well, for vdso systems, hrtimers and adjtimex (which is the only
> interface that provides enough information to understand where in a
> leapsecond you actually are).
> 
> And again, vdsos are fixable, but I hesitated due to my concerns about
> the extra performance overhead, the smaller benefit it provides
> relative to not having timers expiring early.

Right, and I'm concerned about the extra overhead of your patch. Just
look at the cache layout.

Current:

struct timekeeper {
       struct tk_read_base        tkr_mono;             /*     0    56 */
       struct tk_read_base        tkr_raw;              /*    56    56 */
       /* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
       u64                        xtime_sec;            /*   112     8 */
       long unsigned int          ktime_sec;            /*   120     8 */
       /* --- cacheline 2 boundary (128 bytes) --- */
       struct timespec            wall_to_monotonic;    /*   128    16 */
       ktime_t                    offs_real;            /*   144     8 */
       ktime_t                    offs_boot;            /*   152     8 */
       ktime_t                    offs_tai;             /*   160     8 */
       s32                        tai_offset;           /*   168     4 */
       unsigned int               clock_was_set_seq;    /*   172     4 */
       struct timespec            raw_time;             /*   176    16 */
       /* --- cacheline 3 boundary (192 bytes) --- */
       cycle_t                    cycle_interval;       /*   192     8 */
       u64                        xtime_interval;       /*   200     8 */
       s64                        xtime_remainder;      /*   208     8 */
       u32                        raw_interval;         /*   216     4 */

       /* XXX 4 bytes hole, try to pack */

       u64                        ntp_tick;             /*   224     8 */
       s64                        ntp_error;            /*   232     8 */
       u32                        ntp_error_shift;      /*   240     4 */
       u32                        ntp_err_mult;         /*   244     4 */
};

Four cachelines where everything which is considered hotpath is in the
first two cache lines.

With your change this becomes:

struct timekeeper {
       struct tk_read_base        tkr_mono;             /*     0    56 */
       struct tk_read_base        tkr_raw;              /*    56    56 */
       /* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
       u64                        xtime_sec;            /*   112     8 */
       long unsigned int          ktime_sec;            /*   120     8 */
       /* --- cacheline 2 boundary (128 bytes) --- */
       struct timespec            wall_to_monotonic;    /*   128    16 */
       ktime_t                    offs_real;            /*   144     8 */
       ktime_t                    offs_boot;            /*   152     8 */
       ktime_t                    offs_tai;             /*   160     8 */
       s32                        tai_offset;           /*   168     4 */
       unsigned int               clock_was_set_seq;    /*   172     4 */
       struct timespec            raw_time;             /*   176    16 */
       /* --- cacheline 3 boundary (192 bytes) --- */
       cycle_t                    cycle_interval;       /*   192     8 */
       u64                        xtime_interval;       /*   200     8 */
       s64                        xtime_remainder;      /*   208     8 */
       u32                        raw_interval;         /*   216     4 */

       /* XXX 4 bytes hole, try to pack */

       u64                        ntp_tick;             /*   224     8 */
       s64                        ntp_error;            /*   232     8 */
       u32                        ntp_error_shift;      /*   240     4 */
       u32                        ntp_err_mult;         /*   244     4 */
       time64_t                   next_leap_sec;        /*   248     8 */
       /* --- cacheline 4 boundary (256 bytes) --- */
       ktime_t                    next_leap_ktime;      /*   256     8 */
       int                        leap_direction;       /*   264     4 */
};
  
That's 5 cache lines, but that's not the worst thing.

For every readout of CLOCK_REALTIME, which will affect in kernel
callers AND all systems which lack a VDSO, we have to load TWO more
cache lines.

We could mitigate that partially by moving the leap second stuff into
the 3rd cacheline, but you also add a conditional into every readout.

Did you try to measure the impact of your changes with sys_gettime()
and VDSO disabled?

> > That means the whole leap mess added into the gettime fast path is
> > just constant overhead for that corner case.
> >
> > We can be smarter than that and just block hrtimer delivery for clock
> > realtime timers across the leap edge. There should be ways to do that
> > non intrusive if we think hard enough about it.
> 
> This approach seems like it would add more work to the timer-add
> function (to check leapstate and adjust the expiration), which might
> be a heavier use case (we adjust for each timer) then the similar
> logic done in the update_base_offsets_now() at hrtimer_interrupt time
> (adjust for each hrtimer_interrupt).

No, certainly not in the timer add function. How should that work at
all? If we arm the timer to expire 24 hours from now, how should we
know about the leap state at expiry time?

> Now, It could be possible to do a lighter weight version of my patch,
> which just does the adjustment only for the hrtimer_interrupt code
> (leaving the rest of the read paths alone).

Yes, that should work. As long as I can keep the cached values in the
hrtimer cpu bases and the whole thing keeps the clock_was_set_seq
logic intact.

If we do not do the conditional version, then on every hrtimer
interrupt we write THREE cachelines for nothing.

And if we cannot cache the offsets, then we end up calling into the
timekeeping code for every timer which is not CLOCK_MONOTONIC based
and retrieve the offset. That hurts especially on 32bit machines,
because we need to protect the readout with the timekeeper sequence
counter.

> If that is something you'd prefer.  I'll have to think a bit to
> ensure the internal inconsistency isn't otherwise problematic.
>
> Though I suspect fixing adjtimex is worth it as well, since its really
> the only interface that can provide a sane view of the leapsecond, and
> isn't normally considered a hot path.

I'm not worried about adjtimex at all. You can do whatever you want
there, but please leave the fast pathes alone.

Thanks,

	tglx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/