linux-kernel - Re: workqueue: WARN at at kernel/workqueue.c:2176

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 16 May 2014 11:50:42 +0800
From:	Lai Jiangshan <laijs@...fujitsu.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	<jjherne@...ux.vnet.ibm.com>, Sasha Levin <sasha.levin@...cle.com>,
	Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	Dave Jones <davej@...hat.com>, Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: workqueue: WARN at at kernel/workqueue.c:2176

On 05/15/2014 12:52 AM, Jason J. Herne wrote:
> On 05/12/2014 10:17 PM, Sasha Levin wrote:
>> I don't have an easy way to reproduce it as I only saw the bug once, but
>> it happened when I started pressuring CPU hotplug paths by adding and removing
>> CPUs often. Maybe it has anything to do with that?
> 
> As per the original report (http://article.gmane.org/gmane.linux.kernel/1643027)
> I am able to reproduce the problem.
> 
> The workload is (on S390 architecture):
> 2 processes onlining random cpus in a tight loop by using 'echo 1 >
> /sys/bus/cpu.../online'
> 2 processes offlining random cpus in a tight loop by using 'echo 0 >
> /sys/bus/cpu.../online'
> Otherwise, fairly idle system. load average: 5.82, 6.27, 6.27
> 
> The machine has 10 processors.
> The warning message some times hits within a few minutes on starting the
> workload. Other times it takes several hours.
> 
> 
> -- Jason J. Herne (jjherne@...ux.vnet.ibm.com)
> 
> 


Hi, Peter and other scheduler Gurus:

When I was trying to test wq-VS-hotplug, I always hit a problem in scheduler
with the following WARNING:

[   74.765519] WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
[   74.765520] Modules linked in: wq_hotplug(O) fuse cpufreq_ondemand ipv6 kvm_intel kvm uinput snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi e1000e snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ptp iTCO_wdt iTCO_vendor_support lpc_ich snd mfd_core pps_core soundcore acpi_cpufreq i2c_i801 microcode wmi radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core
[   74.765545] CPU: 1 PID: 13 Comm: migration/1 Tainted: G           O  3.15.0-rc3+ #153
[   74.765546] Hardware name: LENOVO ThinkCentre M8200T/  , BIOS 5JKT51AUS 11/02/2010
[   74.765547]  000000000000007c ffff880236199c88 ffffffff814d7d2c 0000000000000000
[   74.765550]  0000000000000000 ffff880236199cc8 ffffffff8103add4 ffff880236199cb8
[   74.765552]  ffffffff81023e1b ffff8802361861c0 0000000000000001 ffff88023fd92b40
[   74.765555] Call Trace:
[   74.765559]  [<ffffffff814d7d2c>] dump_stack+0x51/0x75
[   74.765562]  [<ffffffff8103add4>] warn_slowpath_common+0x81/0x9b
[   74.765564]  [<ffffffff81023e1b>] ? native_smp_send_reschedule+0x2d/0x4b
[   74.765566]  [<ffffffff8103ae08>] warn_slowpath_null+0x1a/0x1c
[   74.765568]  [<ffffffff81023e1b>] native_smp_send_reschedule+0x2d/0x4b
[   74.765571]  [<ffffffff8105c2ea>] smp_send_reschedule+0xa/0xc
[   74.765574]  [<ffffffff8105fe46>] resched_task+0x5e/0x62
[   74.765576]  [<ffffffff81060238>] check_preempt_curr+0x43/0x77
[   74.765578]  [<ffffffff81060680>] __migrate_task+0xda/0x100
[   74.765580]  [<ffffffff810606a6>] ? __migrate_task+0x100/0x100
[   74.765582]  [<ffffffff810606c3>] migration_cpu_stop+0x1d/0x22
[   74.765585]  [<ffffffff810a33c6>] cpu_stopper_thread+0x84/0x116
[   74.765587]  [<ffffffff814d8642>] ? __schedule+0x559/0x581
[   74.765590]  [<ffffffff814dae3c>] ? _raw_spin_lock_irqsave+0x12/0x3c
[   74.765592]  [<ffffffff8105bd75>] ? __smpboot_create_thread+0x109/0x109
[   74.765594]  [<ffffffff8105bf46>] smpboot_thread_fn+0x1d1/0x1d6
[   74.765598]  [<ffffffff81056665>] kthread+0xad/0xb5
[   74.765600]  [<ffffffff810565b8>] ? kthread_freezable_should_stop+0x41/0x41
[   74.765603]  [<ffffffff814e0e2c>] ret_from_fork+0x7c/0xb0
[   74.765605]  [<ffffffff810565b8>] ? kthread_freezable_should_stop+0x41/0x41
[   74.765607] ---[ end trace 662efb362b4e8ed0 ]---

After debugging, I found the hotlug-in cpu is atctive but !online in this case.
the problem was introduced by 5fbd036b.
Some code assumes that any cpu in cpu_active_mask is also online, but 5fbd036b breaks
this assumption, so the corresponding code with this assumption should be changed too.


Hi, Jason J. Herne and Sasha Levin

Thank you for testing wq-VS-hotplug.

The following patch is just a workaround. After it is applied, the above WARNING
is gone, but I can't hit the wq problem that you found.

You can use the following workaround patch to test wq-VS-hotplug again or just
wait the scheduler guys give us a proper patch.
(A interesting thing, 5fbd036b also touches the arch s390).

Thanks,
Lai
---
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a9e710e..253a129 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -726,9 +726,10 @@ void set_cpu_present(unsigned int cpu, bool present)
 
 void set_cpu_online(unsigned int cpu, bool online)
 {
-	if (online)
+	if (online) {
 		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
-	else
+		cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
+	} else
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..c1a712d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5043,7 +5043,6 @@ static int sched_cpu_active(struct notifier_block *nfb,
 				      unsigned long action, void *hcpu)
 {
 	switch (action & ~CPU_TASKS_FROZEN) {
-	case CPU_STARTING:
 	case CPU_DOWN_FAILED:
 		set_cpu_active((long)hcpu, true);
 		return NOTIFY_OK;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/