linux-kernel - Re: 2.6.21-mm2 - 100% CPU on ksoftirqd/1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <46420F17.2050106@goop.org>
Date:	Wed, 09 May 2007 11:12:39 -0700
From:	Jeremy Fitzhardinge <jeremy@...p.org>
To:	Valdis.Kletnieks@...edu
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: Re: 2.6.21-mm2 - 100% CPU on ksoftirqd/1

Valdis.Kletnieks@...edu wrote:
> It comes up with a screaming ksoftirqd - usually /1 but one boot had /0.
> Just sitting there, 100% CPU according to 'top'.  Tried 'echo t > /proc/sysrq-trigger' to get
> a trace, but it was always running on the other CPU - even after I reniced
> it down to 19 and launched 2 'for(;;)' C programs to suck the cycles.  It would
> be failing to get any CPU - until I did the 'echo t' and then it would be
> "running" again.  Anybody got any good debugging ideas here?
>   

Huh.  I've just been trying to find a problem with events/1 spinning
after boot in current -git.  I was worried I'd introduced it, but looks
like there might be a wider problem.

I tracked it down to kernel/workqueue.c:run_workqueue(), where the while
(!list_empty(&cwq->worklist)) loop spins forever.  It turns out that the
list_empty test was failing, but cwq->worklist was pointing to itself
(iow, worklist.next points to a self-pointing list node).

I haven't found how it was getting into that state, but the use of
list_del_init(cwq->worklist.next) looks suspect to me.

Curiously, it seems to escape this loop after about 4-5mins, and then
seems well-behaved from then on.  And it doesn't always happen.

And, of course, no changes in kernel/workqueue.c since Feb, so whatever
changed was somewhere else, and I guess it could have affected kevent in
the same way.

It always seems to happen on the last CPU; when I run the kernel as Xen
guest with 4 vcpus, it happens on events/3.

    J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/