lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 13 Feb 2014 15:41:02 -0500
From:	Tejun Heo <tj@...nel.org>
To:	"Jason J. Herne" <jjherne@...ux.vnet.ibm.com>
Cc:	Lai Jiangshan <laijs@...fujitsu.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: Subject: Warning in workqueue.c

Hello,

(cc'ing Ingo and Peter)

On Thu, Feb 13, 2014 at 12:58:10PM -0500, Jason J. Herne wrote:
> [ 5779.795687] ------------[ cut here ]------------
> [ 5779.795695] WARNING: at kernel/workqueue.c:2159
....
> [ 5779.795844] XXX: worker->flags=0x1 pool->flags=0x0 cpu=4 pool->cpu=5(1) rescue_wq=          (null)
> [ 5779.795848] XXX: last_unbind=-44 last_rebind=0 last_rebound_clear=0 nr_exected_after_rebound_clear=0
> [ 5779.795852] XXX: sleep=-39 wakeup=0
> [ 5779.795855] XXX: cpus_allowed=5
> [ 5779.795857] XXX: cpus_allowed_after_rebinding=5
> [ 5779.795861] XXX: after schedule(), cpu=4
> 
> You had asked about reproducing this. This is on the S390 platform,
> I'm not sure if that makes any difference.
> 
> The workload is:
> 2 processes onlining random cpus in a tight loop by using 'echo 1 > /sys/bus/cpu.../online'
> 2 processes offlining random cpus in a tight loop by using 'echo 0 > /sys/bus/cpu.../online'
> Otherwise, fairly idle system. load average: 5.82, 6.27, 6.27
> 
> The machine has 10 processors.
> The warning message some times hits within a few minutes on starting
> the workload. Other times it takes several hours.

Ingo, Peter, Jason is reporting workqueue triggering warning because a
worker is running on the wrong CPU, which is relatively reliably
reproducible with the above workload on s390.  The weird thing is that
everything looks correct from workqueue side.  The worker has proper
cpus_allowed set and the CPU it's supposed to run on is online and yet
the worker is on the wrong CPU and even doing explicit schedule()
after detecting the condition doesn't change the situation.  Any
ideas?

Jason, I don't have much idea from workqueue side.  Have you been
running this test with older kernels too?  Can you confirm whether
this failure is something recent?  Bisection would be awesome but just
confirming, say, 3.12 doesn't have this issue would be very helpful.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ