linux-kernel - Re: [PATCH] softirq softlockup debugging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87prq79kty.fsf@skyscraper.fehenstaub.lan>
Date:	Tue, 24 Jun 2008 12:41:13 +0200
From:	Johannes Weiner <hannes@...urebad.de>
To:	Vegard Nossum <vegard.nossum@...il.com>
Cc:	a.p.zijlstra@...llo.nl, arjan@...ux.intel.com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] softirq softlockup debugging

Hi Vegard,

Vegard Nossum <vegard.nossum@...il.com> writes:

> Hi,
>
> I'm debugging a problem with a softirq that gets stuck for a long time,
> so I wrote this patch to help find out what's going wrong.
>
> I actually think it can be useful in general as well, see for example
> http://www.kerneloops.org/search.php?search=__do_softirq&btnG=Function+Search
>
> ..and these cases are virtually impossible to debug since we don't know
> anything about *what* it was that got stuck. (The NMI watchdog could
> help, though.)
>
> The patch is #ifdef-ugly, I know... Suggestions are welcome.
>
>
> Vegard
>
>
> From: Vegard Nossum <vegard.nossum@...il.com>
> Date: Sun, 22 Jun 2008 14:12:31 +0200
> Subject: [PATCH] softirq softlockup debugging
>
>>>From the Kconfig: If a softlockup happens in a softirq, the softlockup
> stack trace is utterly unhelpful; it will only show the stack up to
> __do_softirq(), since this is where interrupts are reenabled.

After more staring at the code in question, I think that the approach is
not correct (or I didn't understand it, which is not unlikely).

I hunted down the address of the traces from kerneloops.org
(__do_softirq+0x6d) on a kernel image with a fedora config and it's at
the local_irq_enable() right after the restart:label in __do_softirq().

So if the softirq handler had disabled interrupts, the softlockup would
have been detected still within the handler (when it reenables irqs and
the timer irq runs) and the stackframe should be there.

do_softirq()
  local_irq_save()			1)
  local_softirq_pending()
  __do_softirq()
   restart:				2)
    local_irq_enable()			3)
    run a handler
    local_irq_disable()			4)
    jnz restart

So the lockup must be caused somewhere
  between 1) and 3)
or
  between 4) and 3) [when we jump back]

These functions are in the path and possible candidates for causing it:

- local_softirq_pending()
- account_system_vtime()
- __local_bh_disable()
- trace_softirq_enter()
- smp_processor_id()
- set_softirq_pending()

What do you think?  You said you actually used your patch already for
debugging lockups in softirq handlers, so it confuses me why the
stackframe of the handler was no longer present.

	Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/