linux-kernel - [BUG] Soft-lockup during cpu-hotplug in VFS callpaths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <4E550057.9070609@linux.vnet.ibm.com>
Date:	Wed, 24 Aug 2011 19:14:55 +0530
From:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
To:	linux-kernel@...r.kernel.org
CC:	linux-fsdevel@...r.kernel.org, linux-pm@...ts.linux-foundation.org
Subject: [BUG] Soft-lockup during cpu-hotplug in VFS callpaths

Hi,

While running stressful cpu hotplug tests along with kernel compilation
running in background, soft-lockups are detected on multiple CPUs.
Sometimes this also leads to hard lockups and kernel panic.
All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other VFS
callpaths.

[37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669]
<snip>
[37108.694781] Call Trace:
[37108.697306]  [<ffffffff81199e70>] ? vfsmount_lock_local_lock_cpu+0x70/0x70
[37108.704258]  [<ffffffff81187cb5>] path_init+0x315/0x400
[37108.709558]  [<ffffffff8127c398>] ? __raw_spin_lock_init+0x38/0x70
[37108.715812]  [<ffffffff8118961c>] path_openat+0x8c/0x3f0
[37108.721203]  [<ffffffff81012129>] ? sched_clock+0x9/0x10
[37108.726597]  [<ffffffff8109416d>] ? sched_clock_cpu+0xcd/0x110
[37108.732508]  [<ffffffff810a178d>] ? trace_hardirqs_off+0xd/0x10
[37108.738498]  [<ffffffff8109421f>] ? local_clock+0x6f/0x80
[37108.743970]  [<ffffffff81189a99>] do_filp_open+0x49/0xa0
[37108.749362]  [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
[37108.754665]  [<ffffffff8152584b>] ? _raw_spin_unlock+0x2b/0x40
[37108.760575]  [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
[37108.765875]  [<ffffffff81179607>] do_sys_open+0x107/0x1e0
[37108.771352]  [<ffffffff810d610f>] ? audit_syscall_entry+0x1bf/0x1f0
[37108.777695]  [<ffffffff81179720>] sys_open+0x20/0x30
[37108.782741]  [<ffffffff8152e202>] system_call_fastpath+0x16/0x1b

Kernel version: 3.0.1, 3.0.3
Hardware: Dual socket quad-core hyper-threaded Intel x86 machine
Scenario:
(a) Stressful cpu hotplug tests + kernel compilation

(b) IRQ balancing had been disabled and all the IRQs  were made to be
    routed to CPU 0 (except the ones that couldn't be routed).

(c) Lockdep was enabled during kernel configuration.

Steps (b) and (c) were done to dig deeper into the issue. However the same
issue was observed by just doing step (a).

Definitely there seems to be a race condition occurring here, because this
issue is hit after sometime, after starting the tests. And the time it
takes to hit the issue increases as we increase the number of debug print
statements. In some cases (especially when the number of debug print
statements were quite high), the stress on the machine had to be increased
in order to hit the issue within measurable time. In my tests, a maximum
of about 2 to 2.5 hours was sufficient, to hit this bug.

Please find the console log attached with this mail.

Any ideas on how to go about fixing this bug?

-- 
Regards,
Srivatsa S. Bhat  <srivatsa.bhat@...ux.vnet.ibm.com>
Linux Technology Center,
IBM India Systems and Technology Lab

View attachment "soft-lockup_log.txt" of type "text/plain" (58725 bytes)