lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 3 May 2024 18:01:49 -0700
From: John Stultz <jstultz@...gle.com>
To: Tejun Heo <tj@...nel.org>, Lai Jiangshan <jiangshanlai@...il.com>
Cc: Will Deacon <will@...nel.org>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, LKML <linux-kernel@...r.kernel.org>, 
	kernel-team@...roid.com
Subject: WW_MUTEX_SELFTEST hangs w/ 6.9-rc workqueue changes

Hey All,
   In doing some local testing, I noticed I've started to see boot
stalls with CONFIG_WW_MUTEX_SELFTEST with 6.9-rc on a 64cpu qemu
environment.

I've bisected the problem down to:
  5797b1c18919 (workqueue: Implement system-wide nr_active enforcement
for unbound workqueues)
+ the fix needed for that change:
  15930da42f89 (workqueue: Don't call cpumask_test_cpu() with -1 CPU
in wq_update_node_max_active())

I've seen problems in the past with the ww_mutex selftest code, so
it's likely a problem in the test itself, but I wanted to raise the
issue so folks were aware and see if there were suggestions for a
solution.

It seems to get stuck in __test_cycle() after a few runs when it hits
flush_workqueue()
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n344

That seems to be because when the various work functions get queued,
they all don't seem to get a chance to run (they use a circular chain
of completions, so the 0th workfunc won't finish until after the
nrthreads-th workfunc runs).
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n295

I'm noticing this happens when the test gets to nrthreads=9 (the test
usually goes up to NR_CPUS), so we queue work for 0->8 but the 9th
worker function never seems to run.  Looking at __queue_work() I do
see pwq_tryinc_nr_active() fails for that 9th work struct and we end
up inserting the work as inactive.

I notice the change that uncovers this issue(5797b1c18919), both
tweaks pwq_tryinc_nr_active() and sets the WQ_DFL_MIN_ACTIVE to 8, so
maybe that's a hint as to if the test is abusing the number of queueud
work functions? Though that seems odd because that's the min not the
max (which seems to be 512).

Anyway, let me know if there's anything further I can help share to
debug this. I'll continue digging here as well.

thanks
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ