linux-kernel - Re: [PATCH] softirq softlockup debugging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <19f34abd0806240445i6282c49cia22a700140dd27eb@mail.gmail.com>
Date:	Tue, 24 Jun 2008 13:45:27 +0200
From:	"Vegard Nossum" <vegard.nossum@...il.com>
To:	"Johannes Weiner" <hannes@...urebad.de>
Cc:	a.p.zijlstra@...llo.nl, arjan@...ux.intel.com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] softirq softlockup debugging

On 6/24/08, Johannes Weiner <hannes@...urebad.de> wrote:
> After more staring at the code in question, I think that the approach is
> not correct (or I didn't understand it, which is not unlikely).
>
> I hunted down the address of the traces from kerneloops.org
> (__do_softirq+0x6d) on a kernel image with a fedora config and it's at
> the local_irq_enable() right after the restart:label in __do_softirq().
>

Are you quite sure? I didn't use the fedora image, but I compiled with
the fedora config and got a function size for __do_softirq that was
exactly equal to some of the reports. And there, the EIP pointed to
the second "pop" instruction after calling... oh, true.
local_irq_enable(). And those pops come from calling into paravirt
ops.

Right, I fully agree.

> So if the softirq handler had disabled interrupts, the softlockup would
> have been detected still within the handler (when it reenables irqs and
> the timer irq runs) and the stackframe should be there.
>
> do_softirq()
>  local_irq_save()                      1)
>  local_softirq_pending()
>  __do_softirq()
>   restart:                             2)
>    local_irq_enable()                  3)
>    run a handler
>    local_irq_disable()                 4)
>    jnz restart
>
> So the lockup must be caused somewhere
>  between 1) and 3)
> or
>  between 4) and 3) [when we jump back]
>
> These functions are in the path and possible candidates for causing it:
>
> - local_softirq_pending()
> - account_system_vtime()
> - __local_bh_disable()
> - trace_softirq_enter()
> - smp_processor_id()
> - set_softirq_pending()
>
> What do you think?  You said you actually used your patch already for
> debugging lockups in softirq handlers, so it confuses me why the
> stackframe of the handler was no longer present.

What didn't make sense to me when first looking at the oops reports
was that there was nowhere inside __do_softirq() that could actually
lock up. As far as I can see, there is no infinite (or very
long-running) loop in there, so I concluded (probably wrongly) that it
must be the handler; it was the only logical explanation I could find.
I believe that the functions you listed all run in constant time, but
maybe we should check it.

It also seems that the timer interrupt (where softlockup detection
kicks in) has a small delay after irqs are re-enabled before it
actually interrupts the CPU. Can you post a disassembly of
__do_softirq() where you pinpoint the exact instruction that was
interrupted, e.g. in this case __do_softirq+0x6d? Again, for me, this
was a pop %ebx or so, which doesn't change interruptibility, so maybe
I simply had the wrong disassembly.

(Actually, I didn't use the patch for debugging yet; I ran with the
patch applied, but I was unable to reproduce the lockup. But KOSAKI
Motohiro found some theoretical ABBA deadlock in cpusets vs. cpu
hotplug that could be the reason for it.)

Thanks for the analysis; I am very curious to see which code that
locks up and how I failed to notice where it was.

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/