lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.0.98.0706180857120.14121@woody.linux-foundation.org>
Date:	Mon, 18 Jun 2007 09:34:40 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Ingo Molnar <mingo@...e.hu>
cc:	Miklos Szeredi <miklos@...redi.hu>, cebbert@...hat.com,
	chris@...ee.ca, linux-kernel@...r.kernel.org, tglx@...utronix.de,
	akpm@...ux-foundation.org
Subject: Re: [BUG] long freezes on thinkpad t60



On Mon, 18 Jun 2007, Ingo Molnar wrote:
> 
> To test this theory, could you try the patch below, does this fix your 
> hangs too?

I really think this the the wrong approach, although *testing* it makes 
sense.

I think we need to handle loops that take, release, and then immediately 
re-take differently.

Such loops are _usually_ of the form where they really just release the 
lock for some latency reason, but in this case I think it's actually just 
a bug.

That code does:

	if (unlikely(p->array || task_running(rq, p))) {

to decide if it needs to just unlock and repeat, but then to decide if it 
need to *yield* it only uses *one* of those tests (namely 

	preempted = !task_running(rq, p);
	..
	if (preempted)
		yield();

and I think that's just broken. It basically says:

 - if the task is running, I will busy-loop on getting/releasing the 
   task_rq_lock

and that is the _real_ bug here.

Trying to make the spinlocks do somethign else than what they do is just 
papering over the real bug. The real bug is that anybody who just 
busy-loops getting a lock is wasting resources so much that we should not 
be at all surprised that some multi-core or NUMA situations will get 
starvation.

Blaming some random Core 2 hardware implementation issue that just makes 
it show up is wrong. It's a software bug, plain and simple.

So how about this diff? The diff looks big, but the *code* is actually 
simpler and shorter, I just added tons of comments, which is what blows it 
up.

The new *code* looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
	        rq = task_rq(p);
	        while (task_running(rq, p))
	                cpu_relax();

		/* Get the *real* values */
	        rq = task_rq_lock(p, &flags);
	        running = task_running(rq, p);
	        array = p->array;
	        task_rq_unlock(rq, &flags);

		/* Check them.. */
	        if (unlikely(running)) {
	                cpu_relax();
	                goto repeat;
	        }

	        if (unlikely(array)) {
	                yield();
	                goto repeat;
	        }

and while I haven't tested it, it looks fairly obviously correct, even if 
it's a bit subtle.

Basically, that first "while()" loop is done entirely without any locking 
at all, and so it's possibly "incorrect", but we don't care. Both the 
runqueue used, and the "task_running()" check might be the wrong tests, 
but they won't oops - they just mean that we might get the wrong results 
due to lack of locking.

So then we get the proper (and careful) rq lock, and check the 
running/runnable state _safely_. And if it turns out that our 
quick-and-dirty and unsafe loop was wrong after all, we just go back and 
try it all again.

Safe, simple, efficient. And we don't ever hold the lock for long times, 
and we will never *loop* by taking, releasing and re-taking the lock. 

Hmm? Untested, I know. Maybe I overlooked something. But even the 
generated assembly code looks fine (much better than it looked before!)

		Linus

----
 kernel/sched.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 13cdab3..66445e1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1159,21 +1159,72 @@ void wait_task_inactive(struct task_struct *p)
 {
 	unsigned long flags;
 	struct rq *rq;
-	int preempted;
+	struct prio_array *array;
+	int running;
 
 repeat:
+	/*
+	 * We do the initial early heuristics without holding
+	 * any task-queue locks at all. We'll only try to get
+	 * the runqueue lock when things look like they will
+	 * work out!
+	 */
+	rq = task_rq(p);
+
+	/*
+	 * If the task is actively running on another CPU
+	 * still, just relax and busy-wait without holding
+	 * any locks.
+	 *
+	 * NOTE! Since we don't hold any locks, it's not
+	 * even sure that "rq" stays as the right runqueue!
+	 * But we don't care, since "task_running()" will
+	 * return false if the runqueue has changed and p
+	 * is actually now running somewhere else!
+	 */
+	while (task_running(rq, p))
+		cpu_relax();
+
+	/*
+	 * Ok, time to look more closely! We need the rq
+	 * lock now, to be *sure*. If we're wrong, we'll
+	 * just go back and repeat.
+	 */
 	rq = task_rq_lock(p, &flags);
-	/* Must be off runqueue entirely, not preempted. */
-	if (unlikely(p->array || task_running(rq, p))) {
-		/* If it's preempted, we yield.  It could be a while. */
-		preempted = !task_running(rq, p);
-		task_rq_unlock(rq, &flags);
+	running = task_running(rq, p);
+	array = p->array;
+	task_rq_unlock(rq, &flags);
+
+	/*
+	 * Was it really running after all now that we
+	 * checked with the proper locks actually held?
+	 *
+	 * Oops. Go back and try again..
+	 */
+	if (unlikely(running)) {
 		cpu_relax();
-		if (preempted)
-			yield();
 		goto repeat;
 	}
-	task_rq_unlock(rq, &flags);
+
+	/*
+	 * It's not enough that it's not actively running,
+	 * it must be off the runqueue _entirely_, and not
+	 * preempted!
+	 *
+	 * So if it wa still runnable (but just not actively
+	 * running right now), it's preempted, and we should
+	 * yield - it could be a while.
+	 */
+	if (unlikely(array)) {
+		yield();
+		goto repeat;
+	}
+
+	/*
+	 * Ahh, all good. It wasn't running, and it wasn't
+	 * runnable, which means that it will never become
+	 * running in the future either. We're all done!
+	 */
 }
 
 /***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ