linux-kernel - Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20180126220917.GI3741@linux.vnet.ibm.com>
Date:   Fri, 26 Jan 2018 14:09:17 -0800
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Sebastian Sewior <bigeasy@...utronix.de>,
        Anna-Maria Gleixner <anna-maria@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug

On Fri, Jan 26, 2018 at 02:54:32PM +0100, Thomas Gleixner wrote:
> The hrtimer interrupt code contains a hang detection and mitigation
> mechanism, which prevents that a long delayed hrtimer interrupt causes a
> continous retriggering of interrupts which prevent the system from making
> progress. If a hang is detected then the timer hardware is programmed with
> a certain delay into the future and a flag is set in the hrtimer cpu base
> which prevents newly enqueued timers from reprogramming the timer hardware
> prior to the chosen delay. The subsequent hrtimer interrupt after the delay
> clears the flag and resumes normal operation.
> 
> If such a hang happens in the last hrtimer interrupt before a CPU is
> unplugged then the hang_detected flag is set and stays that way when the
> CPU is plugged in again. At that point the timer hardware is not armed and
> it cannot be armed because the hang_detected flag is still active, so
> nothing clears that flag. As a consequence the CPU does not receive hrtimer
> interrupts and no timers expire on that CPU which results in RCU stalls and
> other malfunctions.
> 
> Clear the flag along with some other less critical members of the hrtimer
> cpu base to ensure starting from a clean state when a CPU is plugged in.
> 
> Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
> root cause of that hard to reproduce heisenbug. Once understood it's
> trivial and certainly justifies a brown paperbag.

Thank you very much, and I do know that feeling!  After reading the
commit log, I feel significantly less incompetent for having failed to
find this one.  ;-)  But it did pass rcutorture testing for a great many
years, didn't it?  :-/

I have started an eight-hour seven-way test on the dreaded rcutorture
TREE01 scenario.  In the meantime, off to the train!

							Thanx, Paul

> Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
> Reported-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Cc: stable@...r.kernel.org
> ---
>  kernel/time/hrtimer.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -655,7 +655,9 @@ static void hrtimer_reprogram(struct hrt
>  static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
>  {
>  	base->expires_next = KTIME_MAX;
> +	base->hang_detected = 0;
>  	base->hres_active = 0;
> +	base->next_timer = NULL;
>  }
> 
>  /*
> @@ -1589,6 +1591,7 @@ int hrtimers_prepare_cpu(unsigned int cp
>  		timerqueue_init_head(&cpu_base->clock_base[i].active);
>  	}
> 
> +	cpu_base->active_bases = 0;
>  	cpu_base->cpu = cpu;
>  	hrtimer_init_hres(cpu_base);
>  	return 0;
>