linux-kernel - Re: weakness of runnable load tracking?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50C053D7.9050505@intel.com>
Date:	Thu, 06 Dec 2012 16:14:15 +0800
From:	Alex Shi <alex.shi@...el.com>
To:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
CC:	Alex Shi <lkml.alex@...il.com>, Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Paul Turner <pjt@...gle.com>,
	lkml <linux-kernel@...r.kernel.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Tejun Heo <tj@...nel.org>
Subject: Re: weakness of runnable load tracking?

On 12/06/2012 02:52 PM, Preeti U Murthy wrote:
> Hi Alex,
>> Hi Paul & Ingo:
>>
>> In a short word of this issue: burst forking/waking tasks have no time
>> accumulate the load contribute, their runnable load are taken as zero.
> 
> On performing certain experiments on the way PJT's metric calculates the
> load,I observed a few things.Based on these observations let me see if i
> can address the issue of why PJT's metric is calculating the load of
> bursty tasks as 0.
> 
> When we speak about a burst waking task(I will not go into forking
> here),we should also speak about its duty cycle-it burst wakes for 1ms
> for a 10ms duty cycle or burst wakes 9s out of a 10s duty cycle-both
> being 10% tasks wrt their duty cycles.Lets see how load is calculated by
> PJT's metric in each of the above cases.
>            --
>           |  |
>           |  |
> __________|  |
>           A  B
>           1ms
>           <->
>       10ms
> <------------>
>           Example 1
> 
> When the task wakes up at A,it is not yet runnable,and an update of the
> task load takes place.Its runtime so far is 0,and its existing time is
> 10ms.Hence the load is 0/10*1024.Since a scheduler tick happens at B( a
> scheduler tick happens for every 1ms,10ms or 4ms.Let us assume 1ms),an
> update of the load takes place.PJT's metric divides the time elapsed
> into 1ms windows.There is just 1ms window,and hence the runtime is 1ms
> and the load is 1ms/10ms*1024.
> 
> *If the time elapsed between A and B were to be < 1ms,then PJT's metric
> will not capture it*.

An nice description to show this issue. :)
> 
> And under these circumstances the load remains 0/10ms*1024=0.This is the
> situation you are pointing out.Let us assume that these cycle continues
> throughout the lifetime of the load,then the load remains at 0.The
> question is if such tasks which run for periods<1ms is ok to be termed
> as 0 workloads.If it is fine,then what PJT's metric is doing is
> right.Maybe we should ignore such workloads because they hardly
> contribute to the load.Otherwise we will need to reduce the window of
> load update to < 1ms to capture such loads.
> 
> 
> Just for some additional info so that we know what happens to different
> kinds of loads with PJT's metric,consider the below situation:
>                              ------
>                             |      |
>                             |      |
> ____________________________|      |
>                             A      B
>                                1s
>                             <------>
> <----------------------------------->
>       10s
> <------------>
>                            Example 2
> 
> Here at A,the task wakes,just like in Example1 and the load is termed 0.
> In between A and B for every scheduler tick if we consider the load to
> get updated,then the load slowly increases from 0 to 1024 at B.It is
> 1024 here,although this is also a 10% task,whereas in Example1 the load
> is 102.4 - a 10% task.So what is fishy?
> 
> In my opinion,PJT's metric gives the tasks some time to prove their
> activeness after they wake up.In Example2 the task has stayed awake too
> long-1s; irrespective of what % of the total run time it is.Therefore it
> calculates the load to be big enough to balance.
> 
> In the example that you have quoted,the tasks may not have run long
> enough to consider them as candidates for load balance.
> 
> So,essentially what PJT's metric is doing is characterising a task by
> the amount it has run so far.
> 
> 
>> that make select_task_rq do a wrong decision on which group is idlest.
>>
>> There is still 3 kinds of solution is helpful for this issue.
>>
>> a, set a unzero minimum value for the long time sleeping task. but it
>> seems unfair for other tasks these just sleep a short while.
>>
>> b, just use runnable load contrib in load balance. Still using
>> nr_running to judge idlest group in select_task_rq_fair. but that may
>> cause a bit more migrations in future load balance.
>>
>> c, consider both runnable load and nr_running in the group: like in the
>> searching domain, the nr_running number increased a certain number, like
>> double of the domain span, in a certain time. we will think it's a burst
>> forking/waking happened, then just count the nr_running as the idlest
>> group criteria.
>>
>> IMHO, I like the 3rd one a bit more. as to the certain time to judge if
>> a burst happened, since we will calculate the runnable avg at very tick,
>> so if increased nr_running is beyond sd->span_weight in 2 ticks, means
>> burst happening. What's your opinion of this?
>>
>> Any comments are appreciated!
> 
> 
> So Pjt's metric rightly seems to be capturing the load of these bursty
> tasks but you are right in pointing out that when too many such loads
> queue up on the cpu,this metric will consider the load on the cpu as
> 0,which might not be such a good idea.
> 
> It is true that we need to bring in nr_running somewhere.Let me now go
> through your suggestions on where to include nr_running and get back on
> this.I had planned on including nr_running while selecting the busy
> group in update_sd_lb_stats,but select_task_rq_fair is yet another place
> to do this, thats right.Good that this issue was brought up :)

Do you has details for the update_sd_lb_stats enbling? In my image, we
may let time to peace the load variation in load balance.
> 
>> Regards!
>> Alex
>>>
>>>
>>
> 
> Regards
> Preeti U Murthy
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/