[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20080513210513.GA17492@elte.hu>
Date: Tue, 13 May 2008 23:05:13 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Matthew Wilcox <matthew@....cx>,
Sven Wegener <sven.wegener@...aler.net>,
"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>,
Andi Kleen <andi@...stfloor.org>,
LKML <linux-kernel@...r.kernel.org>,
Alexander Viro <viro@....linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [git pull] scheduler fixes
* Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> > Why does it signal to its waiters that "resource is available", when
> > in reality that resource is not available but immediately serialized
> > via a lock? (even if the lock might technically be some _other_
> > object)
>
> So you have 'n' resources available, and you use a counting semaphore
> for that resource counting.
>
> But you'd still need a spinlock to actually protect the list of
> resources themselves. The spinlock protects a different thing than the
> semaphore. The semaphore isn't about mutual exclusion - it's about
> counting resources and waiting when there are too many things in
> flight.
>
> And you seem to think that a counting semaphore is about mutual
> exclusion. It has nothing what-so-ever to do with that.
i was reacting to the point that Matthew made:
" The first thing that each task does is grab a spinlock, so if you
put as much in flight as early as possible, you end up with horrible
contention on that spinlock. "
We were talking about the case of double, parallel up()s. My point was
that the best guess is to put two tasks in flight in the synchronization
primitive. Matthew's point was that the wakeups of the two tasks should
be chained: one task gets woken first, which then wakes the second task
in a chain. [ Matthew, i'm paraphrasing your opinion so please correct
me if i'm misinterpreting your point. ]
My argument is that chaining like that in the synchronization primitive
is bad for parallelism in general, because wakeup latencies are just too
long in general and any action required in the context of the "first
waker" throttles parallelism artificially and introduces artificial
workload delays.
Even in the simple list+lock case you are talking about it's beneficial
to keep as many wakers in flight as possible. The reason is that even
with the worst-possible cacheline bouncing imaginable, it's much better
to spin a bit on a spinlock (which is "horrible bouncing" only on paper,
in practice it's nicely ordered waiting on a FIFO spinlock, with a list
op every 100 nsecs or so) than to implement "soft bouncing" of tasks via
an artificial chain of wakeups.
That artificial chain of wakeups has a latency of a few microseconds
even in the best case - and in the worst-case it can be a delay of
milliseconds later - throttling not just the current parallelism of the
workload but also hiding potential future parallelism. The hardware is
so much faster at messaging. [*] [**]
Ingo
[*] Not without qualifications - maybe not so on 1024 CPU systems, but
certainly so on anything sane up to 16way or so. But if worry about
1024 CPU systems i'd suggest to first take a look at the current
practice in kernel/semaphore.c taking the semaphore internal
spinlock again _after_ a task has woken up, just to remove itself
from the list of waiters. That is rather unnecessary serialization
- at up() time we are already holding the lock, so we should remove
the entry there. That's what the mutex code does too.
[**] The only place where i can see staggered/chained wakeups help is in
the specific case when the wakee runs into a heavy lock _right
after_ wakeup. XFS might be running into that and my reply above
talks about that hypothetical scenario.
If it is so then it is handled incorrectly, because in that case we
dont have any true parallelism at the point of up(), and we know it
right there. There is no reason to pretend at that point that we
have more parallelism, when all we do is we block on a heavy lock
right after wakeup.
Instead, the correct implementation in that case would be to have a
wait queue for that heavy lock (which in other words can be thought
of as a virtual single-resource which is either 'available' or
'unavailable'), and _then_ after that to use a counting semaphore
for the rest of the program flow, which is hopefully more parallel.
I.e. precisely map the true parallelism of the code via the
synchronization primitives, do not over-synchronize it and do not
under-synchronize it. And if there's any doubt which one should be
used, under-synchronize it - because while the scheduler is rather
good at dealing with too many tasks it cannot conjure runnable
tasks out of thin air.
Btw., the AIM7 regression was exactly that: in the 50% regressed
workload the semaphore code hid the true parallelism of the
workload and we had only had 5 tasks on the runqueue and the
scheduler had no chance to saturate all CPUs. In the "good" case
(be that spinlock based or proper-semaphores based BKL) there were
2000 runnable tasks on the runqueues, and the scheduler sorted them
out and batched the workload just fine!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists