lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z4hDK27OV7wK572A@google.com>
Date: Wed, 15 Jan 2025 15:22:19 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: linux-kernel@...r.kernel.org
Cc: linux-tip-commits@...r.kernel.org, 
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, x86@...nel.org, 
	kvm@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [tip: sched/core] sched/clock/x86: Mark sched_clock() noinstr

+KVM and Paolo

On Tue, Jan 31, 2023, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     8739c6811572b087decd561f96382087402cc343
> Gitweb:        https://git.kernel.org/tip/8739c6811572b087decd561f96382087402cc343
> Author:        Peter Zijlstra <peterz@...radead.org>
> AuthorDate:    Thu, 26 Jan 2023 16:08:36 +01:00
> Committer:     Ingo Molnar <mingo@...nel.org>
> CommitterDate: Tue, 31 Jan 2023 15:01:47 +01:00
> 
> sched/clock/x86: Mark sched_clock() noinstr
> 
> In order to use sched_clock() from noinstr code, mark it and all it's
> implenentations noinstr.
> 
> The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
> since it calls out to watchdogs, create a
> pvclock_clocksource_read_nowd() variant doesn't do that and can be
> noinstr.

...

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 16333ba..0f35d44 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -71,12 +71,12 @@ static int kvm_set_wallclock(const struct timespec64 *now)
>  	return -ENODEV;
>  }
>  
> -static u64 kvm_clock_read(void)
> +static noinstr u64 kvm_clock_read(void)
>  {
>  	u64 ret;
>  
>  	preempt_disable_notrace();
> -	ret = pvclock_clocksource_read(this_cpu_pvti());
> +	ret = pvclock_clocksource_read_nowd(this_cpu_pvti());
>  	preempt_enable_notrace();
>  	return ret;
>  }
> @@ -86,7 +86,7 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
>  	return kvm_clock_read();
>  }
>  
> -static u64 kvm_sched_clock_read(void)
> +static noinstr u64 kvm_sched_clock_read(void)
>  {
>  	return kvm_clock_read() - kvm_sched_clock_offset;
>  }
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 5a2a517..56acf53 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -64,7 +64,8 @@ u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
>  	return flags & valid_flags;
>  }
>  
> -u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> +static __always_inline
> +u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
>  {
>  	unsigned version;
>  	u64 ret;
> @@ -77,7 +78,7 @@ u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
>  		flags = src->flags;
>  	} while (pvclock_read_retry(src, version));
>  
> -	if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> +	if (dowd && unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {

This effectively broke sched_clock handling of PVCLOCK_GUEST_STOPPED, which was
added by commit d63285e94af3 ("pvclock: detect watchdog reset at pvclock read").

By skipping the watchdog goo in the kvm_clock_read() path, the watchdogs will only
be petted when kvmclock's pvclock_read_wallclock() is invoked, which AFAICT is
limited to guest boot and suspend+resume (in the guest).  I.e. PVCLOCK_GUEST_STOPPED
is ignored until the *guest* suspends, which completely defeats the purpose of
petting the watchdogs when reading the clock.

I'm guessing no one has noticed/complained because watchdog_timer_fn,
wq_watchdog_timer_fn(), and check_cpu_stall() all manually handling a STOPPED
vCPU by invoking kvm_check_and_clear_guest_paused().  At a glance, only the
hung task detector isn't in on the game, but its doggie still gets petted so
long as one of the other watchdogs runs first.

One option would be to sprinkle noinstr everywhere, because despite
pvclock_touch_watchdogs() calling a lot of functions, all of those functions
are little more than one-liners, i.e. can be noinstr without significant downsides.

Alternatively, we could explicitly revert commit d63285e94af3, and then bribe
someone into finding a better/common place to handle PVCLOCK_GUEST_STOPPED.
I am leaning heavily toward this alternative, because the way things are headed
with kvmclock is that the kernel will stop using it for sched_clock[*], at which
point VMs that run on modern setups won't get the sched_clock behavior anyways.

[*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ