linux-kernel - Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCoueki=keYNcNr4eXqgLFPh3VupDJC0hFqxm4FNKfGzYg@mail.gmail.com>
Date: Fri, 14 Mar 2025 15:50:53 -0700
From: John Stultz <jstultz@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Lei Chen <lei.chen@...rtx.com>, Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

On Fri, Mar 14, 2025 at 12:41 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Mon, Mar 10 2025 at 11:00, Lei Chen wrote:
> > To fix this issue, the patch accumulates offset into tk, and export
> > N(P2) to real tk and vdso.
> >
> > tk.tkr_mono := N(P2) = N(P1) + offset * M1
> >
> > Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> > N(P3) := N(P2) + clock_delta * M2
>
> Your analysis is mostly correct, but it is only correct versus the
> coarse timekeeper.
>
> Moving everything forward to P2 breaks the periodic accumulation and
> therefore NTP. Monitoring NTP/chrony immediately shows that they are
> behaving differently and do not really converge.
>
> The real problem is that the introduction of the coarse time accessors
> completely missed the fact, that xtime_nsec is adjusted by multiplier
> changes. This went unnoticed for about 15 years :)
>
> So the obvious cure is to leave the accumulation logic alone and to
> provide a seperate coarse_nsec instance, which compensates for the
> offset.
>
> The offset contains the reminder of the periodic accumulation from the
> point where timekeeping_advance() read the clocksource at T1.
>
> At the point of readout T1 nanoseconds have been:
>
>      T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
>                    (xtime_nsec[OLD] + offset * mult[OLD]) >> shift;
>
> After the accumulation and eventual multiplier update that becomes:
>
>      T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
>                    (xtime_nsec[NEW] + offset * mult[NEW]) >> shift;
>
> If the unmodified accumulation math is correct then:
>
>      T1nsec[OLD] == T1nsec[NEW]
>
> The patch below implements exactly that and survives lightweight testing
> where neither in kernel nor in userspace these time jumps are
> observable anymore.
>
> It needs quite some eyeballs, testing and validation.
>

Hey Thomas,
 Nice work chasing this down. The approach looks ok, but I'm wary of
adding more state to manage.

Since this issue apparently only happens via do_adjtimex() calling
timekeeping_advance(TK_ADV_FREQ) resulting in an mult adjustment being
made without any accumulation before it, the approach I'm testing is:
in that case, to do ~timekeeping_forward_now() to accumulate the small
offset before doing the mult adjustment. I've got to move a little bit
of code around to do this cleanly, but I'll send a patch out here
shortly.

thanks
-john