linux-kernel - Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+ekxPUKnwNgDLdNhqitukKMgAipwKycPXpmuErUHMk5X-XgBQ@mail.gmail.com>
Date:	Tue, 2 Feb 2016 21:39:18 -0700
From:	Jeffrey Merkey <jeffmerkey@...il.com>
To:	Don Zickus <dzickus@...hat.com>
Cc:	linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
	atomlin@...hat.com, cmetcalf@...hip.com, fweisbec@...il.com,
	hidehiro.kawai.ez@...achi.com, mhocko@...e.cz, tj@...nel.org,
	uobergfe@...hat.com
Subject: Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection

On 2/2/16, Jeffrey Merkey <jeffmerkey@...il.com> wrote:
>> Because when you catch a bug in the hard lockup detector the system
>> just sits there hard hung and you are not able to get into a debugger
>> console since the system has crashed and the watchdog code has already
>> killed off the other processors and locked up all the NMI interrupt
>> handlers, thereby preventing any debugger at all from functioning
>> other than a hardware ice, so it's a hell of a lot easier just to
>> trigger a break when you detect the first instance of a hard lockup
>> before the system is completely hosed.
>>
>
> So this is why Ingo and tglx's suggestion doesn't work.  Unless you
> can set a breakpoint in the detector coede, once the lockup occurs
> about 50% of the time (when the IF flag is not set and interrupts are
> disabled), you can't get into a debugger because the system is hosed.
>
> The way the current hard lockup detector works is a lot like the death
> star self-destruct system for linux -- it detects one, tries to IPI
> the other processors to dump their stacks, then somewhere down in the
> OS all of it locks up -- once and a while I can get it too panic.  A
> great bug to test your detector with is the one in timekeeper.c tglx
> and I worked on.  Good luck getting into any debugger when it fires
> off.  I like the fact this code does not call panic and is somewhat
> dynamic allowing recovery of the system, but it takes a healthy system
> with a single bug, burns it to the ground, locks up all the
> processors, and prevents the debugger from being entered unless a
> breakpoint has been set.
>
> Perhaps this helps you understand.
>
> Jeff
>

And we could just call notify_die here instead and pass a faux
debugger exception.  That actually is clean and would work too.  any
handlers out there will behave as though its an int3 instruction.
Hmmm.  That's an easy patch and I could test it quickly.

Jeff