linux-kernel - Re: Timer refuses to expire

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20171208012610.GB7829@linux.vnet.ibm.com>
Date:   Thu, 7 Dec 2017 17:26:10 -0800
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Boqun Feng <boqun.feng@...il.com>
Cc:     anna-maria@...utronix.de, tglx@...utronix.de,
        linux-kernel@...r.kernel.org
Subject: Re: Timer refuses to expire

On Fri, Dec 08, 2017 at 09:06:40AM +0800, Boqun Feng wrote:
> On Thu, Dec 07, 2017 at 06:56:17AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 07, 2017 at 03:03:50PM +0800, Boqun Feng wrote:
> > > Hi Paul,
> > > 
> > > On Wed, Dec 06, 2017 at 02:04:21PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Dec 05, 2017 at 03:37:44PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Dec 04, 2017 at 09:42:08AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Dec 01, 2017 at 10:25:29AM -0800, Paul E. McKenney wrote:
> > > > > > > Hello, Anna-Maria,
> > > > > > > 
> > > > > > > It turned out that one set of problems was due to NO_HZ_FULL issues,
> > > > > > > which are addressed with a pair of patches.  However, there is still one
> > > > > > > case of a timer refusing to expire in rcutorture's TREE01 test scenario.
> > > > > > > The takes from two to four hours to trigger, and the most recent one
> > > > > > > gives the following diagnostic (see patch at the very end of this email):
> > > > > > > 
> > > > > > > [13166.127458] schedule_timeout: Waylayed timer base->clk: 0x100c40004 jiffies: 0x100c4524e base->next_expiry: 0x100c40004 timer->flags: 0x1000003 timer->expires 0x100c40003 idx: 4 idx_now: ea base->pending_map 000000000000001000000000001080000000000000040000000002000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000
> > > > > > > 
> > > > > > > The message appears any time a timer for less than five jiffies takes
> > > > > > > more than eight seconds to expire.  In all cases, this expiry did not
> > > > > > > happen naturally, but rather via an unsolicited wakeup from the RCU CPU
> > > > > > > stall code.  If I am interpreting this message correctly, base->clk
> > > > > > > has advanced past this timer, and the timer is correctly flagged in
> > > > > > > base->pending_map.  This seems like part of the problem because the
> > > > > > > timer's timeout was only three jiffies.  However, base->clk has not
> > > > > > > advanced for more than 20 seconds, which seems like another problem.
> > > > > > > 
> > > > > > > What additional diagnostics would be helpful?  I can capture data
> > > > > > > at the beginning of the schedule_timeout in the timer_list structure,
> > > > > > > and of course can print more information at the time of the wakeup.
> > > > > > 
> > > > > > And on the off-chance that the following messages from this weekend's
> > > > > > runs are at all helpful.  One interesting point is that starting at
> > > > > > time 4731.872311, there are repeated enqueues to CPU 5's timer wheel,
> > > > > > but CPU 5's ->clk does not advance.  Things seem to advance at
> > > > > > time 4891.721190.
> > > > > 
> > > > > Another layer on the onion...  For at least some of the failures, there
> > > > > is a stalled CPU-hotplug operation.  This of course can mean that the
> > > > > timers are stuck on the partially offlined CPU.  So I dumped the stack
> > > > > of the task that is taking the CPU offline.  Please see below for console
> > > > > output and patch.
> > > > > 
> > > > > I am considering doing an unsolicited wakeup of the task doing the
> > > > > hotplug operation, but I am not convinced that the entirely of the CPU
> > > > > hotplug code is willing to put up with that sort of thing.
> > > > 
> > > > What I did instead was to dump out the state of the task that
> > > > __cpuhp_kick_ap() waits on, please see the patch at the very end of this
> > > > email.  This triggered as shown below, and you guessed it, that task is
> > > > waiting on a grace period.  Which I am guessing won't happen until the
> > > > outgoing CPU reaches CPUHP_TIMERS_DEAD state and calls timers_dead_cpu().
> > > > Which will prevent RCU's grace-period kthread from ever awakening, which
> > > > will prevent the task that __cpuhp_kick_ap() waits on from ever awakening,
> > > > which will prevent the outgoing CPU from reaching CPUHP_TIMERS_DEAD state.
> > > > 
> > > > Deadlock.
> > > 
> > > There is one thing I'm confused here. Sure, this is a deadlock, but the
> > > timer should still work in such a deadlock, right? I mean, the timer of
> > > schedule_timeout() should be able to wake up rcu_gp_kthread() even in
> > > this case? And yes, the gp kthread will continue to wait due to the
> > > deadlock, but the deadlock can not explain the "Waylayed timer", right?
> > 
> > My belief is that the timer cannot fire because it is still on the
> > offlined CPU, and that CPU has not yet reached timers_dead_cpu().
> > But I might be missing something subtle in either the timers code or the
> > CPU-hotplug code, so please do check my reasoning here.  (I am relying on
> > the "timer->flags: 0x40000007" and the "cpuhp/7" below, which I believe
> > means that the timer is on CPU 7 and that it is CPU 7 that is in the
> > process of going offline.)
> 
> So if CPU 7 is going offline and has not yet reached timers_dead_cpu(),
> the timer on CPU 7 can not be fired? I'm not sure this is true, because
> since timers_dead_cpu() has not been reached, the hrtimer and softirq on
> CPU 7 should still be working, so the timer base should still be working
> well on CPU 7, no? I didn't find a place where we "disable" timer base
> on the hotplug path. But maybe I'm missing something..

It clearly isn't waking up the grace-period kthread, whose timer is
on a CPU 7 timer wheel.  But your point about interrupts and softirqs
still seeming like they would be working at that time is a good one.

> > The "Waylayed timer" happens because the RCU CPU stall warning code
> > wakes up the grace-period kthread.  This is driven out of the
> > scheduling-clock tick, so is unaffected by timers, though it does
> > rely on the jiffies counter continuing to be incremented.
> 
> I mean if the timer base works well, then there should be no "Waylayed
> timer", the timer of schedule_timeout() will wake up the gp kthread
> instead of RCU CPU stall code.

I will agree that something is broken somewhere.  ;-)

							Thanx, Paul

> Regards,
> Boqun
> 
> > So what am I missing here?
> > 
> > 							Thanx, Paul
> > 
> > > Regards,
> > > Boqun
> > > 
> > > > Does that make sense, or am I missing a trick here?
> > > > 
> > > > I tried invoking timers_dead_cpu() from sched_cpu_deactivate(), but
> > > > that triggered the BUG_ON(cpu_online(cpu)).  I removed this BUG_ON(),
> > > > and appear to have deadlocked on the timer ->lock.
> > > > 
> > > > Any thoughts on what else to try?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > [ 2939.381210] schedule_timeout: Waylayed timer base->clk: 0x1002709c3 jiffies: 0x100284614 base->next_expiry: 0x1002709c3 timer->flags: 0x40000007 timer->expires 0x10027f3a2 idx: 100 idx_now: 105 base->pending_map 000000000000000800000000000000000002000000000000020080400002000868000000000000010000000001000000000000000000000000000000000000000000000000000000
> > > > [ 2939.381219] Torture onoff task state:
> > > > [ 2939.381221] torture_onoff   D14584   842      2 0x80000000
> > > > [ 2939.381239] Call Trace:
> > > > [ 2939.381245]  ? __schedule+0x33c/0x6f0
> > > > [ 2939.381248]  ? preempt_count_add+0x51/0x90
> > > > [ 2939.381250]  schedule+0x37/0x90
> > > > [ 2939.381270]  schedule_timeout+0x20d/0x4c0
> > > > [ 2939.381276]  wait_for_completion+0x95/0x100
> > > > [ 2939.381282]  ? wake_up_q+0x60/0x60
> > > > [ 2939.381299]  __cpuhp_kick_ap+0x49/0x60
> > > > [ 2939.381313]  cpuhp_kick_ap+0x29/0x70
> > > > [ 2939.381316]  cpuhp_kick_ap_work+0x34/0x130
> > > > [ 2939.381322]  _cpu_down+0x17e/0x1f0
> > > > [ 2939.381326]  do_cpu_down+0x30/0x50
> > > > [ 2939.381354]  torture_offline+0x76/0x140
> > > > [ 2939.381368]  torture_onoff+0xef/0x1c0
> > > > [ 2939.381371]  kthread+0xf0/0x130
> > > > [ 2939.381374]  ? torture_kthread_stopping+0x70/0x70
> > > > [ 2939.381376]  ? kthread_destroy_worker+0x40/0x40
> > > > [ 2939.381378]  ret_from_fork+0x1f/0x30
> > > > [ 2939.381381] __cpuhp_kick_ap task state:
> > > > [ 2939.381383] cpuhp/7         D14504    46      2 0x80000000
> > > > [ 2939.381399] Call Trace:
> > > > [ 2939.381403]  ? __schedule+0x33c/0x6f0
> > > > [ 2939.381406]  schedule+0x37/0x90
> > > > [ 2939.381408]  schedule_timeout+0x20d/0x4c0
> > > > [ 2939.381414]  ? enqueue_task_fair+0x556/0x12b0
> > > > [ 2939.381417]  wait_for_completion+0x95/0x100
> > > > [ 2939.381419]  ? wake_up_q+0x60/0x60
> > > > [ 2939.381431]  __wait_rcu_gp+0x10d/0x140
> > > > [ 2939.381441]  ? sched_cpu_activate+0xc0/0xc0
> > > > [ 2939.381443]  sched_cpu_deactivate+0x36/0xa0
> > > > [ 2939.381453]  ? kfree_call_rcu+0x20/0x20
> > > > [ 2939.381454]  ? __call_rcu+0x530/0x530
> > > > [ 2939.381457]  ? __rcu_read_unlock+0x50/0x50
> > > > [ 2939.381459]  ? __rcu_read_unlock+0x50/0x50
> > > > [ 2939.381462]  cpuhp_invoke_callback+0xa8/0x610
> > > > [ 2939.381466]  cpuhp_thread_fun+0xc5/0x150
> > > > [ 2939.381469]  smpboot_thread_fn+0x15c/0x220
> > > > [ 2939.381476]  kthread+0xf0/0x130
> > > > [ 2939.381478]  ? sort_range+0x20/0x20
> > > > [ 2939.381480]  ? kthread_destroy_worker+0x40/0x40
> > > > [ 2939.381482]  ret_from_fork+0x1f/0x30
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit aa2b802cb432a725589884d2ba47833de21e0871
> > > > Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > > > Date:   Wed Dec 6 09:46:30 2017 -0800
> > > > 
> > > >     EXP: timer: Dump __cpuhp_kick_ap task state
> > > >     
> > > >     Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > > > 
> > > > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > > > index 04892a82f6ac..dc9098946b3a 100644
> > > > --- a/kernel/cpu.c
> > > > +++ b/kernel/cpu.c
> > > > @@ -381,6 +381,8 @@ cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state)
> > > >  	st->bringup = !st->bringup;
> > > >  }
> > > >  
> > > > +void schedule_timeout_set_task3dump(struct task_struct *t);
> > > > +
> > > >  /* Regular hotplug invocation of the AP hotplug thread */
> > > >  static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st)
> > > >  {
> > > > @@ -395,7 +397,9 @@ static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st)
> > > >  	smp_mb();
> > > >  	st->should_run = true;
> > > >  	wake_up_process(st->thread);
> > > > +	schedule_timeout_set_task3dump(st->thread);
> > > >  	wait_for_ap_thread(st, st->bringup);
> > > > +	schedule_timeout_set_task3dump(NULL);
> > > >  }
> > > >  
> > > >  static int cpuhp_kick_ap(struct cpuhp_cpu_state *st, enum cpuhp_state target)
> > > > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > > > index 0a67df4cdeb6..532b43d05219 100644
> > > > --- a/kernel/time/timer.c
> > > > +++ b/kernel/time/timer.c
> > > > @@ -1725,6 +1725,7 @@ static void process_timeout(struct timer_list *t)
> > > >  }
> > > >  
> > > >  static struct task_struct *schedule_timeout_task2dump;
> > > > +static struct task_struct *schedule_timeout_task3dump;
> > > >  static DEFINE_MUTEX(schedule_timeout_task2dump_mutex);
> > > >  void schedule_timeout_set_task2dump(struct task_struct *t)
> > > >  {
> > > > @@ -1732,7 +1733,12 @@ void schedule_timeout_set_task2dump(struct task_struct *t)
> > > >  	WRITE_ONCE(schedule_timeout_task2dump, t);
> > > >  	mutex_unlock(&schedule_timeout_task2dump_mutex);
> > > >  }
> > > > -EXPORT_SYMBOL_GPL(schedule_timeout_set_task2dump);
> > > > +void schedule_timeout_set_task3dump(struct task_struct *t)
> > > > +{
> > > > +	mutex_lock(&schedule_timeout_task2dump_mutex);
> > > > +	WRITE_ONCE(schedule_timeout_task3dump, t);
> > > > +	mutex_unlock(&schedule_timeout_task2dump_mutex);
> > > > +}
> > > >  
> > > >  /**
> > > >   * schedule_timeout - sleep until timeout
> > > > @@ -1824,6 +1830,10 @@ signed long __sched schedule_timeout(signed long timeout)
> > > >  				pr_info("Torture onoff task state:\n");
> > > >  				sched_show_task(schedule_timeout_task2dump);
> > > >  			}
> > > > +			if (schedule_timeout_task3dump) {
> > > > +				pr_info("__cpuhp_kick_ap task state:\n");
> > > > +				sched_show_task(schedule_timeout_task3dump);
> > > > +			}
> > > >  			mutex_unlock(&schedule_timeout_task2dump_mutex);
> > > >  		}
> > > >  	}
> > > > 
> > 
> >