linux-kernel - Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using secondary cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMbhsRQiK=rChiNdwM5M6t=1v_iKwYLhUn=HLW5kSkBzQdso8g@mail.gmail.com>
Date:	Mon, 14 Jan 2013 17:53:40 -0800
From:	Colin Cross <ccross@...roid.com>
To:	Frederic Weisbecker <fweisbec@...il.com>
Cc:	lkml <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Don Zickus <dzickus@...hat.com>,
	Ingo Molnar <mingo@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	liu chuansheng <chuansheng.liu@...el.com>,
	"linux-arm-kernel@...ts.infradead.org" 
	<linux-arm-kernel@...ts.infradead.org>,
	Russell King - ARM Linux <linux@....linux.org.uk>,
	Tony Lindgren <tony@...mide.com>
Subject: Re: [PATCH v2] hardlockup: detect hard lockups without NMIs using
 secondary cpus

On Mon, Jan 14, 2013 at 4:25 PM, Frederic Weisbecker <fweisbec@...il.com> wrote:
> 2013/1/15 Colin Cross <ccross@...roid.com>:
>> On Mon, Jan 14, 2013 at 4:13 PM, Frederic Weisbecker <fweisbec@...il.com> wrote:
>>> I believe this is pretty much what the RCU stall detector does
>>> already: checks for other CPUs being responsive. The only difference
>>> is on how it checks that. For RCU it's about checking for CPUs
>>> reporting quiescent states when requested to do so. In your case it's
>>> about ensuring the hrtimer interrupt is well handled.
>>>
>>> One thing you can do is to enqueue an RCU callback (cal_rcu()) every
>>> minute so you can force other CPUs to report quiescent states
>>> periodically and thus check for lockups.
>>
>> That's a good point, I'll take a look at using that.  A minute is too
>> long, some SoCs have maximum HW watchdog periods of under 30 seconds,
>> but a call_rcu every 10-20 seconds might be sufficient.
>
> Sure. And you can tune CONFIG_RCU_CPU_STALL_TIMEOUT accordingly.

After considering this, I think the hrtimer watchdog is more useful.
RCU stalls are not usually panic events, and I wouldn't want to add a
panic on every RCU stall.  The lack of stack traces on the affected
cpu makes a panic important.  I'm planning to add an ARM DBGPCSR panic
handler, which will be able to dump the PC of a stuck cpu even if it
is not responding to interrupts.  kexec or kgdb on panic might also
allow some inspection of the stack on stuck cpu.

Failing to process interrupts is a much more serious event than an RCU
stall, and being able to detect them separately may be very valuable
for debugging.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/