[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080508120130.GA2860@elte.hu>
Date: Thu, 8 May 2008 14:01:30 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>,
Andi Kleen <andi@...stfloor.org>,
Matthew Wilcox <matthew@....cx>,
LKML <linux-kernel@...r.kernel.org>,
Alexander Viro <viro@....linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>
Subject: [patch] speed up / fix the new generic semaphore code (fix AIM7
40% regression with 2.6.26-rc1)
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > > That said, "idle 0%" is easy when you spin. Do you also have
> > > actual performance numbers?
> >
> > Yes. My conclusion is based on the actual number. cpu idle 0% is
> > just a behavior it should be.
>
> Thanks, that's all I wanted to verify.
>
> I'll leave this overnight, and see if somebody has come up with some
> smart and wonderful patch. And if not, I think I'll apply mine as
> "known to fix a regression", and we can perhaps then improve on things
> further from there.
hey, i happen to have such a smart and wonderful patch =B-)
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.
Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.
The problem comes from the following code. The new semaphore code does
this on down():
spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);
and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.
That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!
The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.
So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another other task gets to lock_kernel() sooner than the "new
owner" scheduled, it will be blocked unnecessarily and for a very long
time when there are 2000 tasks running.
I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.
This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.
Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!
I believe Yanmin's findings and numbers support this analysis too.
The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".
The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:
# v2.6.25 vanilla:
..................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 56096.4 91 207.5 789.7 0.4675
2000 55894.4 94 208.2 792.7 0.4658
# v2.6.26-rc1-166-gc0a1811 vanilla:
...................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 33230.6 83 350.3 784.5 0.2769
2000 31778.1 86 366.3 783.6 0.2648
# v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
...............................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55707.1 92 209.0 795.6 0.4642
2000 55704.4 96 209.0 796.0 0.4642
i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.
Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.
I also ran Linus's spinlock-BKL patch as well:
# v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch:
......................................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55889.0 92 208.3 793.3 0.4657
2000 55891.7 96 208.3 793.3 0.4658
it is about 0.3% faster - but note that that is within the general noise
levels of this test. I'd expect Linus's spinlock-BKL patch to give some
small speedup because the BKL acquire times are short and 2000 tasks
running all at once really increases the context-switching cost and most
BKL contentions are within the cost of context-switching.
But i believe the better solution for that is to remove BKL use from all
hotpaths, not to hide some of its costs by reintroducing it as a
spinlock. Reintroducing the spinlock based BKL would have other
disadvantages as well: it could reintroduce per-CPU-ness assumptions in
BKL-using code and other complications as well. It's also not a very
realistic workload - with 2000 tasks running the system was barely
serviceable.
I'd much rather make BKL costs more apparent and more visible - but 50%
regression was of course too much. But 0.3% for a 2000-tasks workload,
which is near the noise level ... is acceptable i think - especially as
this discussion has now reinvigorated the remove-the-BKL discussions and
patches.
Linus, we can do your spinlock-BKL patch too if you feel strongly about
it, but i'd rather not - we fought so hard for the preemptible BKL :-)
The spinlock-based-BKL patch only worked around the real problem i
believe, because it eliminated the use of the suboptimal new semaphore
code: with spinlocks there's no scheduling at all, so the wakeup/locking
bug of the new semaphore code did not apply. It was not about any
fastpath overhead AFAICS. [we'd have seen that with the
CONFIG_PREEMPT_BKL=y code as well, which has been the default setting
since v2.6.8.]
There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:
text data bss dec hex filename
1241 0 0 1241 4d9 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.after
(because the waiter.up complication got removed.)
Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.
Hm?
Ingo
Signed-off-by: Ingo Molnar <mingo@...e.hu>
---
kernel/semaphore.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)
Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
unsigned long flags;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
__down(sem);
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore
int result = 0;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_interruptible(sem);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
int result = 0;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_killable(sem);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem,
int result = 0;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_timeout(sem, jiffies);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
unsigned long flags;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(list_empty(&sem->wait_list)))
- sem->count++;
- else
+ sem->count++;
+ if (unlikely(!list_empty(&sem->wait_list)))
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
}
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
struct semaphore_waiter {
struct list_head list;
struct task_struct *task;
- int up;
};
/*
@@ -206,11 +203,11 @@ static inline int __sched __down_common(
struct task_struct *task = current;
struct semaphore_waiter waiter;
- list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
- waiter.up = 0;
for (;;) {
+ list_add_tail(&waiter.list, &sem->wait_list);
+
if (state == TASK_INTERRUPTIBLE && signal_pending(task))
goto interrupted;
if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -221,7 +218,7 @@ static inline int __sched __down_common(
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
- if (waiter.up)
+ if (sem->count > 0)
return 0;
}
@@ -259,6 +256,5 @@ static noinline void __sched __up(struct
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
- waiter->up = 1;
wake_up_process(waiter->task);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists