linux-kernel - Re: [PATCH] rcu/tree: consider time a VM was suspended

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YKMbQQ0qBAixXC5p@google.com>
Date:   Tue, 18 May 2021 10:41:21 +0900
From:   Sergey Senozhatsky <senozhatsky@...omium.org>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Suleiman Souhlal <suleiman@...gle.com>, rcu@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] rcu/tree: consider time a VM was suspended

On (21/05/17 09:23), Paul E. McKenney wrote:
> On Sun, May 16, 2021 at 07:27:16PM +0900, Sergey Senozhatsky wrote:
> > Soft watchdog timer function checks if a virtual machine
> > was suspended and hence what looks like a lockup in fact
> > is a false positive.
> > 
> > This is what kvm_check_and_clear_guest_paused() does: it
> > tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
> > and if it's set then we need to touch all watchdogs and bail
> > out.
> > 
> > Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
> > check works fine.
> > 
> > There is, however, one more watchdog that runs from IRQ, so
> > watchdog timer fn races with it, and that watchdog is not aware
> > of PVCLOCK_GUEST_STOPPED - RCU stall detector.
> > 
> > apic_timer_interrupt()
> >  smp_apic_timer_interrupt()
> >   hrtimer_interrupt()
> >    __hrtimer_run_queues()
> >     tick_sched_timer()
> >      tick_sched_handle()
> >       update_process_times()
> >        rcu_sched_clock_irq()
> > 
> > This triggers RCU stalls on our devices during VM resume.
> > 
> > If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
> > before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
> > then there is nothing on this VCPU that touches watchdogs and
> > RCU reads stale gp stall timestamp and new jiffies value, which
> > makes it think that RCU has stalled.
> > 
> > Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
> > don't report RCU stalls when we resume the VM.
> 
> Good point!
> 
> But if I understand your patch correctly, if the virtual machine is
> stopped at any point during a grace period, even if only for a short time,
> stall warnings are suppressed for that grace period forever, courtesy of
> the call to rcu_cpu_stall_reset().  So, if something really is stalling,
> and the virtual machine just happens to stop for a few milliseconds, the
> stall warning is completely suppressed.  Which would make it difficult
> to debug the underlying stall condition.
> 
> Is it possible to provide RCU with information on the duration of the
> virtual-machine stoppage so that RCU could adjust the timing of the
> stall warning?  Maybe by having something like rcu_cpu_stall_reset()
> that takes the duration of the stoppage in jiffies?

Good questions!

And I think I've some bad news and some good news.

As far as I can tell, none of the PVCLOCK_GUEST_STOPPED handlers take
the stoppage duration into consideration. For instance, as soon as
watchdog timer IRQ detects a potential softlockup it checks
PVCLOCK_GUEST_STOPPED and touches all watchdogs, including RCU:

watchdog_timer_fn()
 kvm_check_and_clear_guest_paused()
  pvclock_touch_watchdogs()
   rcu_cpu_stall_reset()                 // + the remaining watchdogs

But things get more complex.

pvclock_clocksource_read() also checks PVCLOCK_GUEST_STOPPED and calls
pvclock_touch_watchdogs(). And this path is executed rather often.

For instance,

apic_timer_interrupt()
 smp_apic_timer_interrupt()
  hrtimer_interrupt()
   __hrtimer_run_queues()
    hrtimer_wakeup()
     try_to_wake_up()
      update_rq_clock()
       sched_clock_cpu()
        sched_clock()
	 kvm_sched_clock_read()
	  kvm_clock_read()
	   pvclock_clocksource_read()
	    pvclock_touch_watchdogs()
	     rcu_cpu_stall_reset()       // + the remaining watchdogs

Or

do_IRQ
 irq_exit
  sched_clock_cpu
   sched_clock
    kvm_sched_clock_read
     kvm_clock_read
      pvclock_clocksource_read
       pvclock_touch_watchdogs
        rcu_cpu_stall_reset()            // + the remaining watchdogs

And so on...

You may wonder what are the good news then.

Well. I'd say that my patch (is not beautiful but it) does not add
a lot of additional or new damage. And it still fixes the valid race
condition, as far as I'm concerned.

I think we need to rework how pvclock_touch_watchdogs() does things
internally, basically what you suggested, and this can be a separate
effort.