[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160203201453.GV26637@redhat.com>
Date: Wed, 3 Feb 2016 15:14:53 -0500
From: Don Zickus <dzickus@...hat.com>
To: Jeffrey Merkey <jeffmerkey@...il.com>
Cc: linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
atomlin@...hat.com, cmetcalf@...hip.com, fweisbec@...il.com,
hidehiro.kawai.ez@...achi.com, mhocko@...e.cz, tj@...nel.org,
uobergfe@...hat.com
Subject: Re: [PATCH v5 3/3] Add BUG_XX() debugging hard/soft lockup detection
On Wed, Feb 03, 2016 at 10:23:42AM -0700, Jeffrey Merkey wrote:
> > Hmm, I am confused here. So you are saying because we are in the nmi
> > handler you can not break into the system? The nmi handler prints some
> > stuff to the screen, pokes the other cpus to print stuff to the screen and
> > then returns to a normal operation. Unless you are saying the act of
> > sending NMI IPIs never completes (because a cpu is blocking IPI
> > interrupts),
> > so the cpu hangs in nmi context and the debugger never has a chance to
> > 'break' in and see what is going on?
> >
> > Cheers,
> > Don
> >
>
> Yes. the nmi handlers never complete for the bug I worked on with
> tglx, probably because an nmi handler is calling timekeeper.c
> somewhere. Some of these lockup bugs may be calling code from the nmi
> handlers that cause the lockup condition in the first place in some
> cases, so it will never reach a call to panic. Looking over this code
> it's damn hard to find a good way to do this that works across all the
> arches without adding another macro to bug.h (BREAK_ON maybe), so I
> just used one that's already there. I'll go back and rethink this
> some more. It could just be as simple as calling panic from the first
> detection -- that works.
So, if you disable 'sysctl_hardlockup_all_cpu_backtrace' and enable
'hardlockup_panic', you should be able to achieve what you want, no?
But you mentioned you wanted to recover? Hence avoiding the panic?
Cheers,
Don
Powered by blists - more mailing lists