lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100708205113.GA15117@anarazel.de>
Date:	Thu, 8 Jul 2010 22:51:13 +0200
From:	Andres Freund <andres@...razel.de>
To:	linux-kernel@...r.kernel.org
Subject:  INFO: rcu_sched_state detected stall on CPU 

Hi all,

I recently got a dual-socket E5520 (only one cpu attached right now,
problems where the same with both though) system where I regularly get
errors like

[  288.281073] INFO: rcu_sched_state detected stall on CPU 1 (t=5890 jiffies)
[  288.281086] INFO: rcu_sched_state detected stall on CPU 5 (t=5890 jiffies)
[  288.281087] sending NMI to all CPUs:
[  288.281096] sending NMI to all CPUs:

After deactivating all power saving mechanisms it seems to have gotten
a bit more stable - it still crashes pretty reliably under
io-load. Graphics-intensive work seems also be able trigger it
reliably. The crashes also occured with the cheap on-board intel
graphics card.

Without the rcu debugging producing the messages above I pretty
regularly get hangs or missing inputs regularly - at times ending
fatal (no sysrq, no keyboard reaction)

Normally I would try to do a bisect, but in this case I am in the
unfortunate Sitation that with earlier kernels I get problems with
other hardware (particularly the sas controller which currently holds
the only disks). So I have no known good version to start from.
Perhaps you have and Idea?

dmesg of different, likely related crashes, lspci -v and my latest
.config are attached.

As I am not sure what kernel code is actually causing the problem -
the backtraces looked innocent enoug on a short, clueless glance - I
dont know who to explicitly CC.

As small additional datapoints: using latencytop I get latencies in
the second area for various things (creating md request, creating
block layer request, radeon_fence_wait).
The problems seem to get more frequent after I enabled lockdep and RCU
debugging - possibly simply making the race more likely?

Thanks,

Andres

View attachment "dmesg" of type "text/plain" (152520 bytes)

View attachment "dmesg" of type "text/plain" (147173 bytes)

View attachment "lspci" of type "text/plain" (23778 bytes)

View attachment ".config" of type "text/plain" (72498 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ