[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAO6TR8VB0P3AMspftW1g6xixAKpEevoF4bXTbMjimaGL5HFhVA@mail.gmail.com>
Date: Mon, 14 Dec 2015 20:18:56 -0700
From: Jeff Merkey <linux.mdb@...il.com>
To: Don Zickus <dzickus@...hat.com>
Cc: LKML <linux-kernel@...r.kernel.org>, akpm@...ux-foundation.org,
uobergfe@...hat.com, atomlin@...hat.com, cmetcalf@...hip.com,
fweisbec@...il.com
Subject: Re: [PATCH 1/1] Fix HARD Lockup Firing off while in debugger
On 12/14/15, Jeff Merkey <linux.mdb@...il.com> wrote:
> On 12/14/15, Don Zickus <dzickus@...hat.com> wrote:
>> On Sat, Dec 12, 2015 at 02:08:13PM -0700, Jeff Merkey wrote:
>>> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
>>> not always catch all cases when a processor is spinning in the nmi
>>> handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved
>>> count can still end up matching the previous value in some cases,
>>> resulting in the hard lockup detector tagging processors inside a
>>
>> Hi Jeff,
>>
>> I am confused here, the 'touch_nmi_watchdog()' was supposed to block the
>> check for hrtimer_interrupts from happening. So if the check is still
>> being
>> executed _after_ you executed touch_nmi_watchdog(), it would imply there
>> was
>> about 10 seconds or so of time elapse from the touch command to the
>> hrtimer
>> check.
>>
>> So I am not sure how the below patch would fix this, other than just add
>> another 10 second delay (for a total of 20 seconds) to your timeout?
>>
>>
>>> debugger and executing a panic. The patch below corrects this
>>> problem. I did not add this to the touch_nmi_function directly
>>> becuase of possible affects on timing issues.
>>>
>>> I have tested this patch and it fixes the problem for kernel debuggers
>>> stopping errant hard lockup events when processors are spinning inside
>>> the debugger.
>>
>> The kernel doesn't normal take patches like this without a corresponding
>> user, which I didn't see attached in this patch or a patch series.
>>
>> Cheers,
>> Don
>>
>
> I'll resend the patch series properly formatted and clean. There is
> a hole in there somewhere that causes this bug. You can reproduce it
> by downloading the mdb debugger, patching linux, building it, then
> removing the call to this function while spinning in the debugger with
> a breakpoint on schedule() set from the debugger console. It does
> fire off in about 20 seconds without this function I have suggested.
>
> You can download the debugger here.
>
> https://github.com/jeffmerkey/linux-stable/compare/v4.3.2...jeffmerkey:mdb-v4.3.2.diff
>
> Use this patch applied to kernel v4.3.2 if you want to easily
> reproduce it and before you build it remove the function call to
> touch_hardlockup_watchdog() at mdb_watchdogs() in
> arch/x86/kernel/debug/mdb/mdb-main.c.
>
> I'll format another patch this time a clean one. I apologize.
>
> Jeff
>
Oh, and don't forget to type "g" for go after setting the schedule()
breakpoint. This will reload all the processors and cause them to
break into the debugger and be held by the debugger at int1 exception.
This is when the touch_nmi_watchdog() breaks.
You also need to do this on an SMP system, It's an SMP bug,
preferablt one with 4 or more processors.
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists