linux-kernel - Re: Timer refuses to expire

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20171208003151.GA17264@linux.vnet.ibm.com>
Date:   Thu, 7 Dec 2017 16:31:51 -0800
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Boqun Feng <boqun.feng@...il.com>
Cc:     anna-maria@...utronix.de, tglx@...utronix.de,
        linux-kernel@...r.kernel.org
Subject: Re: Timer refuses to expire

On Thu, Dec 07, 2017 at 01:45:14PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 07, 2017 at 06:56:17AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 07, 2017 at 03:03:50PM +0800, Boqun Feng wrote:
> 
> [ . . . ]
> 
> > > > What I did instead was to dump out the state of the task that
> > > > __cpuhp_kick_ap() waits on, please see the patch at the very end of this
> > > > email.  This triggered as shown below, and you guessed it, that task is
> > > > waiting on a grace period.  Which I am guessing won't happen until the
> > > > outgoing CPU reaches CPUHP_TIMERS_DEAD state and calls timers_dead_cpu().
> > > > Which will prevent RCU's grace-period kthread from ever awakening, which
> > > > will prevent the task that __cpuhp_kick_ap() waits on from ever awakening,
> > > > which will prevent the outgoing CPU from reaching CPUHP_TIMERS_DEAD state.
> > > > 
> > > > Deadlock.
> > > 
> > > There is one thing I'm confused here. Sure, this is a deadlock, but the
> > > timer should still work in such a deadlock, right? I mean, the timer of
> > > schedule_timeout() should be able to wake up rcu_gp_kthread() even in
> > > this case? And yes, the gp kthread will continue to wait due to the
> > > deadlock, but the deadlock can not explain the "Waylayed timer", right?
> > 
> > My belief is that the timer cannot fire because it is still on the
> > offlined CPU, and that CPU has not yet reached timers_dead_cpu().
> > But I might be missing something subtle in either the timers code or the
> > CPU-hotplug code, so please do check my reasoning here.  (I am relying on
> > the "timer->flags: 0x40000007" and the "cpuhp/7" below, which I believe
> > means that the timer is on CPU 7 and that it is CPU 7 that is in the
> > process of going offline.)
> > 
> > The "Waylayed timer" happens because the RCU CPU stall warning code
> > wakes up the grace-period kthread.  This is driven out of the
> > scheduling-clock tick, so is unaffected by timers, though it does
> > rely on the jiffies counter continuing to be incremented.
> > 
> > So what am I missing here?
> 
> Well, last night's runs had situations where the ->flags CPU didn't
> match the CPU going offline, so I am clearly missing something or another.
> 
> One thing I might have been missing was the CPU-online processing.
> What happens if a CPU goes offline, comes back online, but before ->clk
> gets adjusted there is a schedule_timeout()?  Now, schedule_timeout()
> does compute the absolute ->expires time using jiffies, so the wakeup time
> should not be too far off of the desired time.  Except that the timers
> now have something like 8% slop, and that slop will be calculated on the
> difference between the desired expiration time and the (way outdated)
> ->clk value.  So the 8% might be a rather large number.  For example,
> if the CPU was offline for 12 minutes (unlikely but entirely possible
> with rcutorture testing's random onlining and offlining), the slop on
> a 3-millisecond timer might be a full minute.
> 
> To my timer-naive eyes, it looks like a simple fix is to set
> old_base->must_forward_clk to true in timers_dead_cpu() for each
> timer_base, as shown below.  The other possibility that I considered
> was to instead set ->is_idle, but that looked like an engraved
> invitation to send IPIs to offline CPUs.
> 
> I am giving it a spin.  I still believe that the offline deadlock
> scenario can happen, but one thing at a time...
> 
> Thoughts?

And it is hard to tell whether or not this is helping.  Not too
surprising, given that most of the splats seem to be the deadlock
case instead.

							Thanx, Paul