linux-kernel - Re: [PATCH] sched: properly account IRQ and RT load in SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <48AD5EE0.8070407@novell.com>
Date:	Thu, 21 Aug 2008 08:26:08 -0400
From:	Gregory Haskins <ghaskins@...ell.com>
To:	Ingo Molnar <mingo@...e.hu>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	vatsa <vatsa@...ibm.com>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	"D. Bahi" <dbahi@...ell.com>
Subject: Re: [PATCH] sched: properly account IRQ and RT load in SCHED_OTHER
 load balancing

Ingo Molnar wrote:
> * Gregory Haskins <ghaskins@...ell.com> wrote:
>
>   
>> I haven't had a chance to review the code thoroughly yet, but I had 
>> been working on a similar fix and know that this is sorely needed.  
>> So...
>>     
>
> btw., why exactly does this patch speed up certain workloads? I'm not 
> quite sure about the exact reasons of that.
>
> 	Ingo
>   

I used to have a great demo for the prototype I was working on, but id 
have to dig it up.  The gist of it is that the pre-patched scheduler 
basically gets thrown for a completely loop in the presence of a mixed 
CFS/RT environment.  This isn't a PREEMPT_RT specific problem per se, 
though PREEMPT_RT does bring the problem to the forefront since it has 
so many active RT tasks by default (for the IRQs, etc) which make it 
more evident.

Since an RT tasks previous usage of declaring "load" did not actually 
express the true nature of the RQ load, CFS tasks would have a few 
really nasty things happen to them while trying to run on the system 
simultaneously.  One of them was that you could starve out CFS tasks 
from certain cores (even though there was plenty of CPU bandwidth 
available elsewhere) and the load-balancer would think everything is 
fine and thus fail to make adjustments.

Say you have a 4 core system.  You could, for instance, get into a 
situation where the softirq-net-rx thread was consuming 80% of core 0, 
yet the load balancer would still spread, say, a 40 thread CFS load 
evenly across all cores (approximately 10 per core, though you would 
account for the "load" that the softirq thread contributed too).  The 
threads on the other cores would of course enjoy 100% bandwidth, while 
the ~10 threads on core 0 would only see 1/5th of that bandwidth.

What it comes down to is that the CFS load should have been evenly 
distributed across the available bandwidth of 3*100% + 1*20%, not 4*100% 
as it does today.  The net result is that the application performs in a 
very lopsided manner, with some threads getting significantly less (or 
sometimes zero!) cpu time compared to their peers.  You can make this 
more obvious by nice'ing the CFS load up as high as it will go, which 
will approximate 1/2 of the load of the softirq (since RT tasks 
previously enjoyed a 2*MAX_SCHED_OTHER_LOAD rating.

I have observed this phenomenon (and its fix) while looking at things 
like network intensive workloads.  I'm sure there are plenty of others 
that could cause similar ripples.

The fact is, the scheduler treats "load" to mean certain things which 
simply did not apply to RT tasks.  As you know very well im sure ;), 
"load" is a metric which expresses the share of the cpu that will be 
consumed and this is used by the load balancer to make its decisions.  
However, you can put whatever rating you want on an RT task and it would 
always be irrelevant.  RT tasks run as frequently and as long as they 
want (w.r.t. SCHED_OTHER) independent of what their load rating implies 
to the balancer, so you cannot make an accurate assessment of the true 
"available shares".  This is why the load-balancer would become confused 
and fail to see true imbalance in a mixed environment.  Fixing this, as 
Peter has attempted to do, will result in a much better distribution of 
SCHED_OTHER tasks across the true available bandwidth, and thus improve 
overall performance.

In previous discussions with people, I had always used a metaphor of a 
stream.  A system running SCHED_OTHER tasks is like a smooth running 
stream, but  dispatching an RT task (or an IRQ, even) is like throwing a 
boulder into the water.  It makes a big disruptive splash and causes 
turbulent white water behind it.  And the stream has no influence over 
the size of the boulder, its placement in the stream, nor how long it 
will be staying.

This fix (at least in concept) allows it to become more like gently 
slipping a streamlined aerodynamic object into the water.  The stream 
still cannot do anything about the size or placement of the object, but 
it can at least flow around it and smoothly adapt to the reduced volume 
of water that the stream can carry. :)

HTH
-Greg

Download attachment "signature.asc" of type "application/pgp-signature" (258 bytes)