lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140225103726.GJ9987@twins.programming.kicks-ass.net>
Date:	Tue, 25 Feb 2014 11:37:26 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	Tejun Heo <tj@...nel.org>
Cc:	"Jason J. Herne" <jjherne@...ux.vnet.ibm.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>
Subject: Re: Subject: Warning in workqueue.c

On Mon, Feb 24, 2014 at 01:35:01PM -0500, Tejun Heo wrote:

> That's a bummer but it at least isn't a very new regression.  Peter,
> any ideas on debugging this?  I can make workqueue to play block /
> unblock dance to try to work around the issue but that'd be very
> yucky.  It'd be great to root cause where the cpu selection anomaly is
> coming from.

I'm assuming you're using set_cpus_allowed_ptr() to flip them between
CPUs; the below adds some error paths to that code. In particular we
propagate the __migrate_task() fail (returns the number of tasks
migrated) through the stop_one_cpu() into set_cpus_allowed_ptr().

This way we can see if there was a problem with the migration.

You should be able to now reliably use the return value of
set_cpus_allowed_ptr() to tell if it is now running on a CPU in its
allowed mask.

I've also included an #if 0 retry loop for the fail case; but I suspect
that that might end up deadlocking your machine if you hit that just
wrong, something like the waking CPU endlessly trying to migrate the
task over while the wakee CPU is waiting for completion of something
from the waking CPU.

But its worth a prod I suppose.


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84b23cec0aeb..4c384efac8b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4554,18 +4554,28 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 
 	do_set_cpus_allowed(p, new_mask);
 
+again: __maybe_unused
+
 	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), new_mask))
+	if (cpumask_test_cpu(task_cpu(p), tsk_cpus_allowed(p)))
 		goto out;
 
-	dest_cpu = cpumask_any_and(cpu_active_mask, new_mask);
+	dest_cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p));
 	if (p->on_rq) {
 		struct migration_arg arg = { p, dest_cpu };
+
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, p, &flags);
-		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		ret = stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+#if 0
+		if (ret) {
+			rq = task_rq_lock(p, &flags);
+			goto again;
+		}
+#endif
 		tlb_migrate_finish(p->mm);
-		return 0;
+
+		return ret;
 	}
 out:
 	task_rq_unlock(rq, p, &flags);
@@ -4679,15 +4689,18 @@ void sched_setnuma(struct task_struct *p, int nid)
 static int migration_cpu_stop(void *data)
 {
 	struct migration_arg *arg = data;
+	int ret = 0;
 
 	/*
 	 * The original target cpu might have gone down and we might
 	 * be on another cpu but it doesn't matter.
 	 */
 	local_irq_disable();
-	__migrate_task(arg->task, raw_smp_processor_id(), arg->dest_cpu);
+	if (!__migrate_task(arg->task, raw_smp_processor_id(), arg->dest_cpu))
+		ret = -EAGAIN;
 	local_irq_enable();
-	return 0;
+
+	return ret;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ