[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210521180127.GD4441@paulmck-ThinkPad-P17-Gen-1>
Date: Fri, 21 May 2021 11:01:27 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Sergey Senozhatsky <senozhatsky@...omium.org>
Cc: Josh Triplett <josh@...htriplett.org>,
Steven Rostedt <rostedt@...dmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>,
Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCHv2 1/2] rcu/tree: handle VM stoppage in stall detection
On Sat, May 22, 2021 at 12:56:23AM +0900, Sergey Senozhatsky wrote:
> Soft watchdog timer function checks if a virtual machine
> was suspended and hence what looks like a lockup in fact
> is a false positive.
>
> This is what kvm_check_and_clear_guest_paused() does: it
> tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
> and if it's set then we need to touch all watchdogs and bail
> out.
>
> Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
> check works fine.
>
> There is, however, one more watchdog that runs from IRQ, so
> watchdog timer fn races with it, and that watchdog is not aware
> of PVCLOCK_GUEST_STOPPED - RCU stall detector.
>
> apic_timer_interrupt()
> smp_apic_timer_interrupt()
> hrtimer_interrupt()
> __hrtimer_run_queues()
> tick_sched_timer()
> tick_sched_handle()
> update_process_times()
> rcu_sched_clock_irq()
>
> This triggers RCU stalls on our devices during VM resume.
>
> If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
> before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
> then there is nothing on this VCPU that touches watchdogs and
> RCU reads stale gp stall timestamp and new jiffies value, which
> makes it think that RCU has stalled.
>
> Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
> don't report RCU stalls when we resume the VM.
>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@...omium.org>
I have queued both for testing and further review, thank you!
Thanx, Paul
> ---
>
> v2: fixed powerpc build breakage
>
> kernel/rcu/tree_stall.h | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index d574e3bbd929..bc689911a81d 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -7,6 +7,8 @@
> * Author: Paul E. McKenney <paulmck@...ux.ibm.com>
> */
>
> +#include <linux/kvm_para.h>
> +
> //////////////////////////////////////////////////////////////////////////////
> //
> // Controlling CPU stall warnings, including delay calculation.
> @@ -698,6 +700,14 @@ static void check_cpu_stall(struct rcu_data *rdp)
> (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>
> + /*
> + * If a virtual machine is stopped by the host it can look to
> + * the watchdog like an RCU stall. Check to see if the host
> + * stopped the vm.
> + */
> + if (kvm_check_and_clear_guest_paused())
> + return;
> +
> /* We haven't checked in, so go dump stack. */
> print_cpu_stall(gps);
> if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
> @@ -707,6 +717,14 @@ static void check_cpu_stall(struct rcu_data *rdp)
> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>
> + /*
> + * If a virtual machine is stopped by the host it can look to
> + * the watchdog like an RCU stall. Check to see if the host
> + * stopped the vm.
> + */
> + if (kvm_check_and_clear_guest_paused())
> + return;
> +
> /* They had a few time units to dump stack, so complain. */
> print_other_cpu_stall(gs2, gps);
> if (READ_ONCE(rcu_cpu_stall_ftrace_dump))
> --
> 2.31.1.818.g46aad6cb9e-goog
>
Powered by blists - more mailing lists