linux-kernel - Re: Subject: Warning in workqueue.c

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140213204102.GC17608@htj.dyndns.org>
Date:	Thu, 13 Feb 2014 15:41:02 -0500
From:	Tejun Heo <tj@...nel.org>
To:	"Jason J. Herne" <jjherne@...ux.vnet.ibm.com>
Cc:	Lai Jiangshan <laijs@...fujitsu.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: Subject: Warning in workqueue.c

Hello,

(cc'ing Ingo and Peter)

On Thu, Feb 13, 2014 at 12:58:10PM -0500, Jason J. Herne wrote:
> [ 5779.795687] ------------[ cut here ]------------
> [ 5779.795695] WARNING: at kernel/workqueue.c:2159
....
> [ 5779.795844] XXX: worker->flags=0x1 pool->flags=0x0 cpu=4 pool->cpu=5(1) rescue_wq=          (null)
> [ 5779.795848] XXX: last_unbind=-44 last_rebind=0 last_rebound_clear=0 nr_exected_after_rebound_clear=0
> [ 5779.795852] XXX: sleep=-39 wakeup=0
> [ 5779.795855] XXX: cpus_allowed=5
> [ 5779.795857] XXX: cpus_allowed_after_rebinding=5
> [ 5779.795861] XXX: after schedule(), cpu=4
> 
> You had asked about reproducing this. This is on the S390 platform,
> I'm not sure if that makes any difference.
> 
> The workload is:
> 2 processes onlining random cpus in a tight loop by using 'echo 1 > /sys/bus/cpu.../online'
> 2 processes offlining random cpus in a tight loop by using 'echo 0 > /sys/bus/cpu.../online'
> Otherwise, fairly idle system. load average: 5.82, 6.27, 6.27
> 
> The machine has 10 processors.
> The warning message some times hits within a few minutes on starting
> the workload. Other times it takes several hours.

Ingo, Peter, Jason is reporting workqueue triggering warning because a
worker is running on the wrong CPU, which is relatively reliably
reproducible with the above workload on s390.  The weird thing is that
everything looks correct from workqueue side.  The worker has proper
cpus_allowed set and the CPU it's supposed to run on is online and yet
the worker is on the wrong CPU and even doing explicit schedule()
after detecting the condition doesn't change the situation.  Any
ideas?

Jason, I don't have much idea from workqueue side.  Have you been
running this test with older kernels too?  Can you confirm whether
this failure is something recent?  Bisection would be awesome but just
confirming, say, 3.12 doesn't have this issue would be very helpful.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/