[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48AD5EE0.8070407@novell.com>
Date: Thu, 21 Aug 2008 08:26:08 -0400
From: Gregory Haskins <ghaskins@...ell.com>
To: Ingo Molnar <mingo@...e.hu>
CC: Peter Zijlstra <peterz@...radead.org>,
Nick Piggin <nickpiggin@...oo.com.au>,
vatsa <vatsa@...ibm.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
"D. Bahi" <dbahi@...ell.com>
Subject: Re: [PATCH] sched: properly account IRQ and RT load in SCHED_OTHER
load balancing
Ingo Molnar wrote:
> * Gregory Haskins <ghaskins@...ell.com> wrote:
>
>
>> I haven't had a chance to review the code thoroughly yet, but I had
>> been working on a similar fix and know that this is sorely needed.
>> So...
>>
>
> btw., why exactly does this patch speed up certain workloads? I'm not
> quite sure about the exact reasons of that.
>
> Ingo
>
I used to have a great demo for the prototype I was working on, but id
have to dig it up. The gist of it is that the pre-patched scheduler
basically gets thrown for a completely loop in the presence of a mixed
CFS/RT environment. This isn't a PREEMPT_RT specific problem per se,
though PREEMPT_RT does bring the problem to the forefront since it has
so many active RT tasks by default (for the IRQs, etc) which make it
more evident.
Since an RT tasks previous usage of declaring "load" did not actually
express the true nature of the RQ load, CFS tasks would have a few
really nasty things happen to them while trying to run on the system
simultaneously. One of them was that you could starve out CFS tasks
from certain cores (even though there was plenty of CPU bandwidth
available elsewhere) and the load-balancer would think everything is
fine and thus fail to make adjustments.
Say you have a 4 core system. You could, for instance, get into a
situation where the softirq-net-rx thread was consuming 80% of core 0,
yet the load balancer would still spread, say, a 40 thread CFS load
evenly across all cores (approximately 10 per core, though you would
account for the "load" that the softirq thread contributed too). The
threads on the other cores would of course enjoy 100% bandwidth, while
the ~10 threads on core 0 would only see 1/5th of that bandwidth.
What it comes down to is that the CFS load should have been evenly
distributed across the available bandwidth of 3*100% + 1*20%, not 4*100%
as it does today. The net result is that the application performs in a
very lopsided manner, with some threads getting significantly less (or
sometimes zero!) cpu time compared to their peers. You can make this
more obvious by nice'ing the CFS load up as high as it will go, which
will approximate 1/2 of the load of the softirq (since RT tasks
previously enjoyed a 2*MAX_SCHED_OTHER_LOAD rating.
I have observed this phenomenon (and its fix) while looking at things
like network intensive workloads. I'm sure there are plenty of others
that could cause similar ripples.
The fact is, the scheduler treats "load" to mean certain things which
simply did not apply to RT tasks. As you know very well im sure ;),
"load" is a metric which expresses the share of the cpu that will be
consumed and this is used by the load balancer to make its decisions.
However, you can put whatever rating you want on an RT task and it would
always be irrelevant. RT tasks run as frequently and as long as they
want (w.r.t. SCHED_OTHER) independent of what their load rating implies
to the balancer, so you cannot make an accurate assessment of the true
"available shares". This is why the load-balancer would become confused
and fail to see true imbalance in a mixed environment. Fixing this, as
Peter has attempted to do, will result in a much better distribution of
SCHED_OTHER tasks across the true available bandwidth, and thus improve
overall performance.
In previous discussions with people, I had always used a metaphor of a
stream. A system running SCHED_OTHER tasks is like a smooth running
stream, but dispatching an RT task (or an IRQ, even) is like throwing a
boulder into the water. It makes a big disruptive splash and causes
turbulent white water behind it. And the stream has no influence over
the size of the boulder, its placement in the stream, nor how long it
will be staying.
This fix (at least in concept) allows it to become more like gently
slipping a streamlined aerodynamic object into the water. The stream
still cannot do anything about the size or placement of the object, but
it can at least flow around it and smoothly adapt to the reduced volume
of water that the stream can carry. :)
HTH
-Greg
Download attachment "signature.asc" of type "application/pgp-signature" (258 bytes)
Powered by blists - more mailing lists