lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAJCQCtSZCCPLZpVX_bQ4Nnqhw1d2sgx_MeK9ZkW8iHqnZTdc5Q@mail.gmail.com>
Date:   Tue, 13 Jul 2021 11:04:41 -0600
From:   Chris Murphy <lists@...orremedies.com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Subject: 5.14-rc1: BUG: workqueue lockup

Hi,

[    0.000000] kernel: Linux version 5.14.0-0.rc1.16.fc35.x86_64+debug
(mockbuild@...rnel01.iad2.fedoraproject.org) (gcc (GCC) 11.1.1
20210623 (Red Hat 11.1.1-6), GNU ld version 2.36.1-15.fc35) #1 SMP Mon
Jul 12 14:29:14 UTC 2021

Lenovo Thinkpad X1

Sequence of events: boot seems normal, I never went looking for or
noticed the early splats and lockdep warnings related to bluetooth,
worked for a couple hours, put the laptop in s2idle, resumed work, and
then in Terminal ran 'stress-ng -c8' and immediately the whole system
became unresponsive. Not even the mouse arrow would move. And it
wasn't possible to ssh into the laptop. During the 10 minutes, some
desktop UI did change so the graphical environment was still working
but substantially delayed compared to the inputs.

I've been seeing workqueue lockups in Fedora openQA testing with VM's
that do not have bluetooth. Therefore I think the usb and bluetooth
related splat and lockdep warning early on have nothing to do with the
later workqueue lockup.

dmesg (2 week expiration)
https://pastebin.com/zgkLiSkp

This excerpt is from the full log, just as a marker for when stress-ng
was started:
[ 6448.192901] stress-ng[6238]: invoked with 'stress-n' by user 1000

These are the first kernel message to appear following loss of control
(responsiveness):

[ 6485.133492] kernel: perf: interrupt took too long (2540 > 2500),
lowering kernel.perf_event_max_sample_rate to 78000
[ 6503.012190] kernel: BUG: workqueue lockup - pool cpus=0 node=0
flags=0x0 nice=0 stuck for 54s!
[ 6503.012206] kernel: BUG: workqueue lockup - pool cpus=1 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012213] kernel: BUG: workqueue lockup - pool cpus=2 node=0
flags=0x0 nice=0 stuck for 36s!
[ 6503.012219] kernel: BUG: workqueue lockup - pool cpus=3 node=0
flags=0x0 nice=0 stuck for 53s!
[ 6503.012226] kernel: BUG: workqueue lockup - pool cpus=5 node=0
flags=0x0 nice=0 stuck for 38s!

So roughly 53 seconds before the first BUG is reported by the kernel,
but loss of control happened from the time stress-ng was run.

Kernel config:
https://pastebin.com/QzvEy1sQ

-- 
Chris Murphy

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ