linux-kernel - [PATCH 0/4] workqueue: Detect stalled in-flight workers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20260211-wqstall_start-at-v1-0-bd9499a18c19@debian.org>
Date: Wed, 11 Feb 2026 04:29:14 -0800
From: Breno Leitao <leitao@...ian.org>
To: Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>, 
 Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, Omar Sandoval <osandov@...ndov.com>, 
 kernel-team@...a.com, Breno Leitao <leitao@...ian.org>
Subject: [PATCH 0/4] workqueue: Detect stalled in-flight workers

The workqueue watchdog detects pools that haven't made forward progress
by checking whether pending work items on the worklist have been waiting
too long. However, this approach has a blind spot: if a pool has only
one work item and that item has already been dequeued and is executing on
a worker, the worklist is empty and the watchdog skips the pool entirely.
This means a single hogged worker with no other pending work is invisible
to the stall detector.

I was able to come up with the following example that shows this blind
spot:

	static void stall_work_fn(struct work_struct *work)
	{
		for (;;) {
			mdelay(1000);
			cond_resched();
		}
	}

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

This series addresses both issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 introduces pool_has_stalled_worker(), which scans all workers in
a pool's busy_hash for any whose current_start timestamp exceeds the
watchdog threshold. This is called unconditionally for every pool,
independent of worklist state, so a stuck worker is always detected. The
feature is gated behind a new CONFIG_WQ_WATCHDOG_WORKERS option
(disabled by default) under CONFIG_WQ_WATCHDOG.

An option is to get rid of CONFIG_WQ_WATCHDOG_WORKERS completely. I've
been running this change on some hosts with workloads (mainly stress-ng)
and I haven't found any false positive.

With this series applied, we will be able to see a stall like the one
above:

	 BUG: workqueue lockup - worker365:stall_work_fn [wq_stall] stuck in pool cpus=9 node=0 flags=0x0 nice=0 for 2570s!
	 Showing busy workqueues and worker pools:
	  workqueue events: flags=0x100
	  pwq 38: cpus=9 node=0 flags=0x0 nice=0 active=2 refcnt=3
	  workqueue stall_wq: flags=0x0

---
Breno Leitao (4):
      workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
      workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
      workqueue: Show in-flight work item duration in stall diagnostics
      workqueue: Detect stalled in-flight work items with empty worklist

 kernel/workqueue.c          | 71 ++++++++++++++++++++++++++++++++++++++-------
 kernel/workqueue_internal.h |  1 +
 lib/Kconfig.debug           | 12 ++++++++
 3 files changed, 74 insertions(+), 10 deletions(-)
---
base-commit: 9cb8b0f289560728dbb8b88158e7a957e2e90a14
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--  
Breno Leitao <leitao@...ian.org>