linux-kernel - Re: [PATCH v3 00/18] vDSO: Introduce generic data storage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8e9fb0c37ae4a3f60b09b8da5d39dbf909ec038e.camel@infradead.org>
Date: Fri, 14 Feb 2025 12:04:38 +0000
From: David Woodhouse <dwmw2@...radead.org>
To: Thomas Gleixner <tglx@...utronix.de>, Thomas
 Weißschuh <thomas.weissschuh@...utronix.de>, "James E.J.
 Bottomley" <James.Bottomley@...senPartnership.com>, Helge Deller
 <deller@....de>, Andy Lutomirski <luto@...nel.org>, Vincenzo Frascino
 <vincenzo.frascino@....com>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Frederic Weisbecker <frederic@...nel.org>,  Andrew Morton
 <akpm@...ux-foundation.org>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>, Theodore Ts'o <tytso@....edu>, "Jason A.
 Donenfeld" <Jason@...c4.com>, Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>,
 Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>,
 Russell King <linux@...linux.org.uk>, Heiko Carstens <hca@...ux.ibm.com>,
 Vasily Gorbik <gor@...ux.ibm.com>, Alexander Gordeev
 <agordeev@...ux.ibm.com>, Christian Borntraeger
 <borntraeger@...ux.ibm.com>, Sven Schnelle <svens@...ux.ibm.com>, Thomas
 Bogendoerfer <tsbogend@...ha.franken.de>, Michael Ellerman
 <mpe@...erman.id.au>, Nicholas Piggin <npiggin@...il.com>, Christophe Leroy
 <christophe.leroy@...roup.eu>, Naveen N Rao <naveen@...nel.org>, Madhavan
 Srinivasan <maddy@...ux.ibm.com>, Ingo Molnar <mingo@...hat.com>, Borislav
 Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,  Arnd Bergmann
 <arnd@...db.de>, Guo Ren <guoren@...nel.org>
Cc: linux-parisc@...r.kernel.org, linux-kernel@...r.kernel.org, 
 linux-arm-kernel@...ts.infradead.org, linux-riscv@...ts.infradead.org, 
 loongarch@...ts.linux.dev, linux-s390@...r.kernel.org, 
 linux-mips@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org, 
 linux-arch@...r.kernel.org, Nam Cao <namcao@...utronix.de>, 
 linux-csky@...r.kernel.org, "Ridoux, Julien" <ridouxj@...zon.com>, "Luu,
 Ryan" <rluu@...zon.com>, kvm <kvm@...r.kernel.org>
Subject: Re: [PATCH v3 00/18] vDSO: Introduce generic data storage

On Fri, 2025-02-14 at 12:34 +0100, Thomas Gleixner wrote:
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> It's called once per tick to expose the continous updates to the
> conversion factors and related internal data.

My recollection (a vague one) is that it's called, and reports
"changes", even when there *are* no changes to underlying conversion
factors. Something along the lines of "N ticks at 333 counts per tick,
then one tick at 334 counts per tick to catch up" because it can't
express the division factor completely without that discontinuity?

The actual 'error' caused by the apparent fluctuation in rate is
probably entirely negligible, but I am slightly concerned about the
steal time, if the hypervisor then spends stolen CPU time relaying all
those "changes" to the guest, and then the guest has to spend time
feeding the "changes" into its own timekeeping.

I'd like to strive for a mode where we only adjust what we tell guests,
when adjtimex actually changes the real timing factors.

In fact if we have a userspace tool like chrony feeding adjtimex based
on external NTP/PPS/whatever, that tool could probably calibrate a
stable host TSC directly against the external real time. And in that
mode maybe we don't even need to feed the guest from the kernel's
CLOCK_REALTIME; that would be just another conversion step to introduce
noise.

We might end up with the direct setup for dedicated hosting
environments, but I do also want to support the general-purpose QEMU-
based setup where we expose the host's CLOCK_REALTIME as efficiently as
possible.

How about this: A KVM feature to provide/populate the VMCLOCK, since
only KVM knows the precise TSC scaling (and can immediately flip the
VMCLOCK to report invalid state if the TSC becomes unreliable).

It can *either* be fed the precise TSC/realtime relationship from
userspace (maybe in a vmclock structure that *userspace* populates, so
all the kernel has to do is scale/offset to account for the guest TSC
being different from the host TSC).

Or it can be in 'automatic' mode, where it derives from the host's
timekeeping. Which at the moment would have "too many" updates for my
liking, but we can worry about that later if necessary.

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5069 bytes)