linux-kernel - Unreliable 11-minute RTC sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20191127112011.GT2634@localhost>
Date:   Wed, 27 Nov 2019 12:20:11 +0100
From:   Miroslav Lichvar <mlichvar@...hat.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     John Stultz <john.stultz@...aro.org>,
        Prarit Bhargava <prarit@...hat.com>,
        linux-kernel@...r.kernel.org
Subject: Unreliable 11-minute RTC sync

When the system clock is synchronized (i.e. the STA_UNSYNC flag is
cleared by NTP/PTP), the kernel is expected to copy the system time to
the RTC every 11 minutes.

There are reports that it doesn't work. I checked some of my machines
and indeed some have their RTC off by more than a second. IIRC this
worked better few years ago.

In order for the RTC to be set precisely the update needs to happen at
some fraction of the second (e.g. 0.5s on x86_64). Originally, the RTC
was set only if it the update was scheduled correctly to one jiffie.
Later this requirement was relaxed to 5 jiffies. It seems with current
kernels that rarely happens. The update seems to be consistently late
by tens of milliseconds, sometimes by hundreds of milliseconds. This
repeats every second until an update is on time with some luck.
Apparently, this may take days or longer.

I'm not sure if workqueues changed how they behave, or they now have
more work to do, preventing the RTC update to be on time. I tried
switching to the non-power-efficient wq and also the high priority wq.
The former worked best in my tests, taking about 5 attempts on average
to make an update. I suspect that may be specific to this machine and
workload.

I'm not sure what would be the best fix.

Some ideas:
- relax the requirements on accuracy even more (e.g. 0.1 second)
- limit the number of retries (e.g. to 5) and force the update on the
  last one, no matter how inaccurate it is
- measure the scheduling delay and try to compensate for it
- randomize the requested delay
- switch to timer

Suggestions?

-- 
Miroslav Lichvar