linux-kernel - Re: [cxl:for-7.0/cxl-init] [dax/hmem, e820, resource] bc62f5b308: BUG:soft_lockup-CPU##stuck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <69728632c464b_1d33100dd@dwillia2-mobl4.notmuch>
Date: Thu, 22 Jan 2026 12:18:58 -0800
From: <dan.j.williams@...el.com>
To: kernel test robot <oliver.sang@...el.com>, Dan Williams
	<dan.j.williams@...el.com>
CC: <oe-lkp@...ts.linux.dev>, <lkp@...el.com>, Alison Schofield
	<alison.schofield@...el.com>, Vishal Verma <vishal.l.verma@...el.com>, "Ira
 Weiny" <ira.weiny@...el.com>, Dan Williams <dan.j.williams@...el.com>,
	<linux-cxl@...r.kernel.org>, Dave Jiang <dave.jiang@...el.com>, "Smita
 Koralahalli" <Smita.KoralahalliChannabasappa@....com>,
	<linux-kernel@...r.kernel.org>, <nvdimm@...ts.linux.dev>,
	<oliver.sang@...el.com>
Subject: Re: [cxl:for-7.0/cxl-init] [dax/hmem, e820, resource] bc62f5b308:
 BUG:soft_lockup-CPU##stuck_for#s![kworker:#:#]

kernel test robot wrote:
> 
> 
> Hello,
> 
> FYI. we don't have enough knowledge to understand how the issues we found
> in the tests are related with the code. we just run the tests up to 200 times
> for both this commit and parent, noticed there are various random issues on
> this commit, but always clean on parent.
> 
> 
> =========================================================================================
> tbox_group/testcase/rootfs/kconfig/compiler/sleep:
>   vm-snb/boot/debian-11.1-i386-20220923.cgz/i386-randconfig-141-20260117/gcc-14/1
> 
> 29317f8dc6ed601e bc62f5b308cbdedf29132fe96e9
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :200          2%           5:200   dmesg.BUG:soft_lockup-CPU##stuck_for#s![kworker##:#]
>            :200          2%           5:200   dmesg.BUG:soft_lockup-CPU##stuck_for#s![kworker:#:#]
>            :200          8%          17:200   dmesg.BUG:soft_lockup-CPU##stuck_for#s![swapper:#]
>            :200          2%           4:200   dmesg.BUG:workqueue_lockup-pool
>            :200          0%           1:200   dmesg.EIP:__schedule
>            :200          0%           1:200   dmesg.EIP:_raw_spin_unlock_irq
>            :200          2%           4:200   dmesg.EIP:_raw_spin_unlock_irqrestore
>            :200          6%          11:200   dmesg.EIP:console_emit_next_record
>            :200          0%           1:200   dmesg.EIP:finish_task_switch
>            :200          3%           6:200   dmesg.EIP:lock_acquire
>            :200          1%           2:200   dmesg.EIP:lock_release
>            :200          1%           2:200   dmesg.EIP:queue_work_on
>            :200          0%           1:200   dmesg.EIP:rcu_preempt_deferred_qs_irqrestore
>            :200          1%           2:200   dmesg.EIP:timekeeping_notify
>            :200          0%           1:200   dmesg.INFO:rcu_preempt_detected_stalls_on_CPUs/tasks
>            :200          0%           1:200   dmesg.INFO:task_blocked_for_more_than#seconds
>            :200         14%          27:200   dmesg.Kernel_panic-not_syncing:softlockup:hung_tasks
> 
> below is full report.

So this is good data, but I do not know what to do with it. The
RCU_STRICT_GRACE_PERIOD feature seems to want to make RCU usage bugs
more detectable, but at the risk of false positives. My concern is that
this patch disturbs 32-bit x86 builds just enough to make the softlockup
detector start getting upset about this rcu_gp::strict_work_handler
workqueue.

So unless this causes actual boot failures all I can assume is that this
is a false positive report. Nothing in this patch is touching workqueues
or object lifetime issues. So I can only assume this is a side effect of
instruction cache layout, or similar.