lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157A85EC0B1B2D45CB611FAD4F02@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Sat, 8 Feb 2025 18:03:35 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Sean Christopherson <seanjc@...gle.com>, Thomas Gleixner
	<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov
	<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "x86@...nel.org"
	<x86@...nel.org>, "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Juergen Gross <jgross@...e.com>, "K. Y. Srinivasan" <kys@...rosoft.com>,
	Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, Dexuan
 Cui <decui@...rosoft.com>, Ajay Kaher <ajay.kaher@...adcom.com>, Jan Kiszka
	<jan.kiszka@...mens.com>, Paolo Bonzini <pbonzini@...hat.com>, Andy
 Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>,
	"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "xen-devel@...ts.xenproject.org"
	<xen-devel@...ts.xenproject.org>, Nikunj A Dadhania <nikunj@....com>, Tom
 Lendacky <thomas.lendacky@....com>
Subject: RE: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's
 constant and non-stop

From: Sean Christopherson <seanjc@...gle.com> Sent: Friday, February 7, 2025 9:23 AM
> 
> Dropping a few people/lists whose emails are bouncing.
> 
> On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> >  #ifdef CONFIG_X86_LOCAL_APIC
> >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> >  #endif
> > +	/*
> > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> 
> This is wrong, wallclock is a different MSR entirely.
> 
> > +	 */
> >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> 
> And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> tsc_{save,restore}_sched_clock_state() results in time going haywire.
> 
> Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> under kvmclock leads to the same problem.
> 
> The whole PV sched_clock scheme is a disaster.
> 
> Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> sched_clock.  And the code is all kinds of funky, because it tries to keep the
> x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> means that Hyper-V overides the sched_clock save/restore hooks even when
> PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.

Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
it's not built if targeting some other architecture. Or do you see something else
that is x86-specific?

Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
PARAVIRT is forced to "n". So I don't think there's a current problem with the
sched_clock save/restore hooks. But I would be good with some restructuring
so that setting the sched clock save/restore hooks is more closely tied to the
sched clock choice, as long as the architecture independence of hyperv_timer.c
is preserved. And maybe there's a better way to handle hv_setup_sched_clock()
that is less messy with the #ifdef's. I'll think about that too.

Michael

> 
> VMware appears to be buggy and doesn't do have offset adjustments, and also lets
> the TSC callbacks run.
> 
> I can't tell if Xen is broken, or if it's the sanest of the bunch.  Xen does
> save/restore things a la kvmclock, but only in the Xen PV suspend path.  So if
> the "normal" suspend/hibernate paths are unreachable, Xen is sane.  If not, Xen
> is quite broken.
> 
> To make matters worse, kvmclock is a mess and has existing bugs.  The BSP's clock
> is disabled during syscore_suspend() (via kvm_suspend()), but only re-enabled in
> the sched_clock callback.  So if suspend is aborted due to a late wakeup, the BSP
> will run without its clock enabled, which "works" only because KVM-the-hypervisor
> is kind enough to not clobber the shared memory when the clock is disabled.  But
> over time, I would expect time on the BSP to drift from APs.
> 
> And then there's this crud:
> 
>   #ifdef CONFIG_X86_LOCAL_APIC
> 	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
>   #endif
> 
> which (a) should be guarded by CONFIG_SMP, not X86_LOCAL_APIC, and (b) is only
> actually needed when kvmclock is sched_clock, because timekeeping doesn't actually
> need to start that early.  But of course kvmclock craptastic handling of suspend
> and resume makes untangling that more difficult than it needs to be.
> 
> The icing on the cake is that after cleaning up all the hacks, and having
> kvmclock hook clocksource.suspend/resume like it should, suspend/resume under
> kvmclock corrupts wall clock time because timekeeping_resume() reads the persistent
> clock before resuming clocksource clocks, and the stupid kvmclock wall clock subtly
> consumes the clocksource/system clock.  *sigh*
> 
> I have yet more patches to clean all of this up.  The series is rather unwieldly,
> as it's now sitting at 38 patches (ugh), but I don't see a way to chunk it up in
> a meaningful way, because everything is so intertwined.  :-/


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ