linux-kernel - 5.14-rc1: BUG: workqueue lockup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAJCQCtSZCCPLZpVX_bQ4Nnqhw1d2sgx_MeK9ZkW8iHqnZTdc5Q@mail.gmail.com>
Date:   Tue, 13 Jul 2021 11:04:41 -0600
From:   Chris Murphy <lists@...orremedies.com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Subject: 5.14-rc1: BUG: workqueue lockup

Hi,

[    0.000000] kernel: Linux version 5.14.0-0.rc1.16.fc35.x86_64+debug
(mockbuild@...rnel01.iad2.fedoraproject.org) (gcc (GCC) 11.1.1
20210623 (Red Hat 11.1.1-6), GNU ld version 2.36.1-15.fc35) #1 SMP Mon
Jul 12 14:29:14 UTC 2021

Lenovo Thinkpad X1

Sequence of events: boot seems normal, I never went looking for or
noticed the early splats and lockdep warnings related to bluetooth,
worked for a couple hours, put the laptop in s2idle, resumed work, and
then in Terminal ran 'stress-ng -c8' and immediately the whole system
became unresponsive. Not even the mouse arrow would move. And it
wasn't possible to ssh into the laptop. During the 10 minutes, some
desktop UI did change so the graphical environment was still working
but substantially delayed compared to the inputs.

I've been seeing workqueue lockups in Fedora openQA testing with VM's
that do not have bluetooth. Therefore I think the usb and bluetooth
related splat and lockdep warning early on have nothing to do with the
later workqueue lockup.

dmesg (2 week expiration)
https://pastebin.com/zgkLiSkp

This excerpt is from the full log, just as a marker for when stress-ng
was started:
[ 6448.192901] stress-ng[6238]: invoked with 'stress-n' by user 1000

These are the first kernel message to appear following loss of control
(responsiveness):

[ 6485.133492] kernel: perf: interrupt took too long (2540 > 2500),
lowering kernel.perf_event_max_sample_rate to 78000
[ 6503.012190] kernel: BUG: workqueue lockup - pool cpus=0 node=0
flags=0x0 nice=0 stuck for 54s!
[ 6503.012206] kernel: BUG: workqueue lockup - pool cpus=1 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012213] kernel: BUG: workqueue lockup - pool cpus=2 node=0
flags=0x0 nice=0 stuck for 36s!
[ 6503.012219] kernel: BUG: workqueue lockup - pool cpus=3 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012226] kernel: BUG: workqueue lockup - pool cpus=5 node=0
flags=0x0 nice=0 stuck for 38s!

So roughly 53 seconds before the first BUG is reported by the kernel,
but loss of control happened from the time stress-ng was run.

Kernel config:
https://pastebin.com/QzvEy1sQ

-- 
Chris Murphy