linux-kernel - Re: [PATCH 2/2] futex: lockless wakeups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1429538139.2042.10.camel@stgolabs.net>
Date:	Mon, 20 Apr 2015 06:55:39 -0700
From:	Davidlohr Bueso <dave@...olabs.net>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Chris Mason <clm@...com>, Steven Rostedt <rostedt@...dmis.org>,
	fredrik.markstrom@...driver.com, linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/2] futex: lockless wakeups

On Mon, 2015-04-20 at 08:18 +0200, Ingo Molnar wrote:
> Please write a small description we can cite to driver authors once 
> the (inevitable) breakages appear, outlining this new behavior and its 
> implications, so that we can fix any remaining bugs ASAP.

I wouldn't call this new behavior, simply because changing a critical
region should not be labeled as such imho. Other than asking driver
authors to put their schedule() in a loop to confirm that the expected
condition has in fact occurred, I'm not sure what else we can ask them
to do -- as you know, this is not just about futexes.

> I'll also let this pending a bit longer than other changes, to make 
> sure we shake out any bugs/regressions triggered by this change.
> 
> Third, it might make sense to add a new 'spurious wakeup injection 
> debug mechanism' that, if enabled in the .config, automatically and 
> continuously inserts spurious wakeups at a given, slightly randomized 
> rate - which would ensure that all kernel facilities can robustly 
> handle spurious wakeups.

I have been using this from Peter to test against:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..fdf1f68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -214,9 +214,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
 #define TASK_WAKEKILL		128
 #define TASK_WAKING		256
 #define TASK_PARKED		512
-#define TASK_STATE_MAX		1024
+#define TASK_YIELD              1024
+#define TASK_STATE_MAX		2048
 
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPY"
 
 extern char ___assert_task_state[1 - 2*!!(
 		sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..2c938ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,7 +1005,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * ttwu() will sort out the placement.
 	 */
 	WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
-			!p->on_rq);
+			!p->on_rq && !(p->state & TASK_YIELD));
 
 #ifdef CONFIG_LOCKDEP
 	/*
@@ -2743,6 +2743,23 @@ static void __sched __schedule(void)
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+
+			/*
+			 * Provide an auto-yield feature on schedule().
+			 *
+			 * The thought is to avoid a sleep+wakeup cycle
+			 * if simply yielding the cpu will suffice to
+			 * satisfy the required condition.
+			 *
+			 * Assumes the calling schedule() site can deal
+			 * with spurious wakeups.
+			 */
+			if (prev->state & TASK_YIELD) {
+				prev->state &= ~TASK_YIELD;
+				if (rq->nr_running > 1)
+					goto no_deactivate;
+			}
+
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
 			prev->on_rq = 0;
 
@@ -2759,6 +2776,7 @@ static void __sched __schedule(void)
 					try_to_wake_up_local(to_wakeup);
 			}
 		}
+	no_deactivate:
 		switch_count = &prev->nvcsw;
 	}
 

> My guess would be that most remaining fragilities against spurious 
> wakeups ought to be in the boot/init phase, so I'd keep an eye out for 
> suspend/resume regressions.

Correct, which is why I'm not that concerned anymore about spurious
wakups, in fact that code above now boots and handles correctly on
rather large systems.

> 
> > [...] However there is core code that cannot handle them afaict, and 
> > furthermore tglx does have the point that other events can already 
> > trigger them anyway.
> 
> s/there is core code/there is no core code

heh yes.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/