lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0a6b88c0edd85a2ae0886e5454afea09cfcd3a24.camel@infradead.org>
Date: Fri, 07 Feb 2025 10:15:49 +0000
From: David Woodhouse <dwmw2@...radead.org>
To: Thomas Weißschuh <thomas.weissschuh@...utronix.de>
Cc: "James E.J. Bottomley" <James.Bottomley@...senpartnership.com>, Helge
 Deller <deller@....de>, Andy Lutomirski <luto@...nel.org>, Thomas Gleixner
 <tglx@...utronix.de>,  Vincenzo Frascino <vincenzo.frascino@....com>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Frederic Weisbecker
 <frederic@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Catalin
 Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>, Theodore
 Ts'o <tytso@....edu>,  "Jason A. Donenfeld" <Jason@...c4.com>, Paul
 Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>,
 Albert Ou <aou@...s.berkeley.edu>, Huacai Chen <chenhuacai@...nel.org>,
 WANG Xuerui <kernel@...0n.name>, Russell King <linux@...linux.org.uk>,
 Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
 Alexander Gordeev <agordeev@...ux.ibm.com>, Christian Borntraeger
 <borntraeger@...ux.ibm.com>, Sven Schnelle <svens@...ux.ibm.com>, Thomas
 Bogendoerfer <tsbogend@...ha.franken.de>, Michael Ellerman
 <mpe@...erman.id.au>, Nicholas Piggin <npiggin@...il.com>, Christophe Leroy
 <christophe.leroy@...roup.eu>, Naveen N Rao <naveen@...nel.org>, Madhavan
 Srinivasan <maddy@...ux.ibm.com>, Ingo Molnar <mingo@...hat.com>, Borislav
 Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,  Arnd Bergmann
 <arnd@...db.de>, Guo Ren <guoren@...nel.org>, linux-parisc@...r.kernel.org,
  linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org, 
 linux-riscv@...ts.infradead.org, loongarch@...ts.linux.dev, 
 linux-s390@...r.kernel.org, linux-mips@...r.kernel.org, 
 linuxppc-dev@...ts.ozlabs.org, linux-arch@...r.kernel.org, Nam Cao
 <namcao@...utronix.de>, linux-csky@...r.kernel.org, "Ridoux, Julien"
 <ridouxj@...zon.com>, "Luu, Ryan" <rluu@...zon.com>, kvm
 <kvm@...r.kernel.org>
Subject: Re: [PATCH v3 00/18] vDSO: Introduce generic data storage

On Thu, 2025-02-06 at 11:59 +0100, Thomas Weißschuh wrote:
> On Thu, Feb 06, 2025 at 09:31:42AM +0000, David Woodhouse wrote:
> > On Tue, 2025-02-04 at 13:05 +0100, Thomas Weißschuh wrote:
> > > Currently each architecture defines the setup of the vDSO data page on
> > > its own, mostly through copy-and-paste from some other architecture.
> > > Extend the existing generic vDSO implementation to also provide generic
> > > data storage.
> > > This removes duplicated code and paves the way for further changes to
> > > the generic vDSO implementation without having to go through a lot of
> > > per-architecture changes.
> > > 
> > > Based on v6.14-rc1 and intended to be merged through the tip tree.
> 
> Note: The real answer will need to come from the timekeeping
> maintainers, my personal two cents below.
> 
> > Thanks for working on this. Is there a plan to expose the time data
> > directly to userspace in a form which is usable *other* than by
> > function calls which get the value of the clock at a given moment?
> 
> There are no current plans that I am aware of.
> 
> > For populating the vmclock device¹ we need to know the actual
> > relationship between the hardware counter (TSC, arch timer, etc.) and
> > real time in order to propagate that to the guest.
> > 
> > I see two options for doing this:
> > 
> >  1. Via userspace, exposing the vdso time data (and a notification when
> >     it changes?) and letting the userspace VMM populate the vmclock.
> >     This is complex for x86 because of TSC scaling; in fact userspace
> >     doesn't currently know the precise scaling from host to guest TSC
> >     so we'd have to be able to extract that from KVM.
> 
> Exposing the raw vdso time data is problematic as it precludes any
> evolution to its datastructures, like the one we are currently doing.
> 
> An additional, trimmed down and stable data structure could be used.
> But I don't think it makes sense. The vDSO is all about a stable
> highlevel function interface on top of an unstable data interface.
> However the vmclock needs the lowlevel data to populate its own
> datastructure, wrapping raw data access in function calls is unnecessary.
> If no functions are involved then the vDSO is not needed. The data can
> be maintained separately in any other place in the kernel and accessed
> or mapped by userspace from there.
> Also the vDSO does not have an active notification mechanism, this would
> probably be implemented through a filedescriptor, but then the data
> can also be mapped through exactly that fd.
> 
> >  2. In kernel, asking KVM to populate the vmclock structure much like
> >     it does other pvclocks shared with the guest. KVM/x86 already uses
> >     pvclock_gtod_register_notifier() to hook changes; should we expand
> >     on that? The problem with that notifier is that it seems to be
> >     called far more frequently than I'd expect.
> 
> This sounds better, especially as any custom ABI from the host kernel to
> the VMM would look a lot like the vmclock structure anyways.
> 
> Timekeeper updates are indeed very frequent, but what are the concrete
> issues? That frequency is fine for regular vDSO data page updates,
> updating the vmclock data page should be very similar.
> The timekeeper core can pass context to the notifier callbacks, maybe
> this can be used to skip some expensive steps where possible.

In the context of a hypervisor with lots of guests running, that's a
lot of pointless steal time. But it isn't just that; ISTR the result
was also *inaccurate*.

I need to go back and reproduce the testing, but I think it was
constantly adjusting the apparent rate even with no changed inputs from
NTP. Where the number of clock counts per jiffy wasn't an integer, the
notification would be constantly changing, for example to report 333333
counts per jiffy for most of the time, and occasionally 333334 counts
for a single jiffy before flipping back again. Or something like that.

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5069 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ