lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6onnUthSBUVAklf@google.com>
Date: Mon, 10 Feb 2025 08:21:49 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "x86@...nel.org" <x86@...nel.org>, 
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>, Juergen Gross <jgross@...e.com>, 
	"K. Y. Srinivasan" <kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, 
	Dexuan Cui <decui@...rosoft.com>, Ajay Kaher <ajay.kaher@...adcom.com>, 
	Jan Kiszka <jan.kiszka@...mens.com>, Paolo Bonzini <pbonzini@...hat.com>, 
	Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, 
	"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>, 
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>, Nikunj A Dadhania <nikunj@....com>, 
	Tom Lendacky <thomas.lendacky@....com>
Subject: Re: [PATCH 16/16] x86/kvmclock: Use TSC for sched_clock if it's
 constant and non-stop

On Sat, Feb 08, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@...gle.com> Sent: Friday, February 7, 2025 9:23 AM
> > 
> > Dropping a few people/lists whose emails are bouncing.
> > 
> > On Fri, Jan 31, 2025, Sean Christopherson wrote:
> > > @@ -369,6 +369,11 @@ void __init kvmclock_init(void)
> > >  #ifdef CONFIG_X86_LOCAL_APIC
> > >  	x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
> > >  #endif
> > > +	/*
> > > +	 * Save/restore "sched" clock state even if kvmclock isn't being used
> > > +	 * for sched_clock, as kvmclock is still used for wallclock and relies
> > > +	 * on these hooks to re-enable kvmclock after suspend+resume.
> > 
> > This is wrong, wallclock is a different MSR entirely.
> > 
> > > +	 */
> > >  	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
> > >  	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> > 
> > And usurping sched_clock save/restore is *really* wrong if kvmclock isn't being
> > used as sched_clock, because when TSC is reset on suspend/hiberation, not doing
> > tsc_{save,restore}_sched_clock_state() results in time going haywire.
> > 
> > Subtly, that issue goes all the way back to patch "x86/paravirt: Don't use a PV
> > sched_clock in CoCo guests with trusted TSC" because pulling the rug out from
> > under kvmclock leads to the same problem.
> > 
> > The whole PV sched_clock scheme is a disaster.
> > 
> > Hyper-V overrides the save/restore callbacks, but _also_ runs the old TSC callbacks,
> > because Hyper-V doesn't ensure that it's actually using the Hyper-V clock for
> > sched_clock.  And the code is all kinds of funky, because it tries to keep the
> > x86 code isolated from the generic HV clock code, but (a) there's already x86 PV
> > specific code in drivers/clocksource/hyperv_timer.c, and (b) splitting the code
> > means that Hyper-V overides the sched_clock save/restore hooks even when
> > PARAVIRT=n, i.e. when HV clock can't possibly be used as sched_clock.
> 
> Regarding (a), the one occurrence of x86 PV-specific code hyperv_timer.c is
> the call to paravirt_set_sched_clock(), and it's under an #ifdef sequence so that
> it's not built if targeting some other architecture. Or do you see something else
> that is x86-specific?
> 
> Regarding (b), in drivers/hv/Kconfig, CONFIG_HYPERV always selects PARAVIRT.
> So the #else clause (where PARAVIRT=n) in that #ifdef sequence could arguably
> have a BUILD_BUG() added. If I recall correctly, other Hyper-V stuff breaks if
> PARAVIRT is forced to "n". So I don't think there's a current problem with the
> sched_clock save/restore hooks. i

Oh, there are no build issues, and all of the x86 bits are nicely cordoned off.
My complaint is essentially that they're _too_ isolated; putting the sched_clock
save/restore setup in arch/x86/kernel/cpu/mshyperv.c is well-intentioned, but IMO
it does more harm than good because the split makes it difficult to connect the
dots to hv_setup_sched_clock() in drivers/clocksource/hyperv_timer.c.

> But I would be good with some restructuring so that setting the sched clock
> save/restore hooks is more closely tied to the sched clock choice,

Yeah, this is the intent of my ranting.  After the dust settles, the code can
look like this.

---
#ifdef CONFIG_GENERIC_SCHED_CLOCK
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/*
	 * We're on an architecture with generic sched clock (not x86/x64).
	 * The Hyper-V sched clock read function returns nanoseconds, not
	 * the normal 100ns units of the Hyper-V synthetic clock.
	 */
	sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
}
#elif defined CONFIG_PARAVIRT
static u64 hv_ref_counter_at_suspend;
/*
 * Hyper-V clock counter resets during hibernation. Save and restore clock
 * offset during suspend/resume, while also considering the time passed
 * before suspend. This is to make sure that sched_clock using hv tsc page
 * based clocksource, proceeds from where it left off during suspend and
 * it shows correct time for the timestamps of kernel messages after resume.
 */
static void hv_save_sched_clock_state(void)
{
	hv_ref_counter_at_suspend = hv_read_reference_counter();
}

static void hv_restore_sched_clock_state(void)
{
	/*
	 * Adjust the offsets used by hv tsc clocksource to
	 * account for the time spent before hibernation.
	 * adjusted value = reference counter (time) at suspend
	 *                - reference counter (time) now.
	 */
	hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
}

static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
	/* We're on x86/x64 *and* using PV ops */
	paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
				 hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
#endif /* CONFIG_GENERIC_SCHED_CLOCK */
---

> as long as the architecture independence of hyperv_timer.c is preserved.

LOL, ah yes, the architecture independence of MSRs and TSC :-D

Teasing aside, the code is firmly x86-only at the moment.  It's selectable only
by x86:

  config HYPERV_TIMER
	def_bool HYPERV && X86
 
and since at least commit e39acc37db34 ("clocksource: hyper-v: Provide noinstr
sched_clock()") there are references to symbols/functions that are provided only
by x86.

I assume arm64 support is a WIP, but keeping the upstream code arch independent
isn't very realistic if the code can't be at least compile-tested.  To help
drive-by contributors like myself, maybe select HYPER_TIMER on arm64 for
COMPILE_TEST=y builds?

  config HYPERV_TIMER
	def_bool HYPERV && (X86 || (COMPILE_TEST && ARM64))

I have no plans to touch code outside of CONFIG_PARAVIRT, i.e. outside of code
that is explicitly x86-only, but something along those lines would help people
like me understand the goal/intent, and in theory would also help y'all maintain
the code by detecting breakage.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ