linux-kernel - Re: [PATCH] sched: fix unfairness when upgrade weight

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <486D8D48.6050701@cn.fujitsu.com>
Date:	Fri, 04 Jul 2008 10:39:04 +0800
From:	Lai Jiangshan <laijs@...fujitsu.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	jha@...il.unc.edu, Ingo Molnar <mingo@...e.hu>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"James H. Anderson" <anderson@...unc.edu>, jeffay@...unc.edu
Subject: Re: [PATCH] sched: fix unfairness when upgrade weight

Peter Zijlstra wrote:
> Hi Lai,
> 
> On Mon, 2008-06-30 at 14:27 +0800, Lai Jiangshan wrote:
>> When two or more process upgrade their priority,
>> unfairness will happen, several of them may get all cpu-usage,
>> and the other cannot be scheduled to run for a long time.
>>
>> example:
>> # (create 2 processes and set affinity to cpu#0)
>> # renice 19 pid1 pid2
>> # renice -19 pid1 pid2
>>
>> step3 upgrade the 2 processes' weight, these 2 processes should
>> share the cpu#0 as soon as possible after step3, and any of them
>> should get 50% cpu-usage. But sometimes one of them gets all cpu-usage
>> for tens of seconds before they share the cpu#0.
>>
>> fair-group example:
>> # mkdir 1 2 (create 2 fair-groups)
>> # (create 2 processes and set affinity to cpu#0)
>> # echo pid1 > 1/tasks ; echo pid2 > 2/tasks
>> # echo 2 > 1/cpu.shares ; echo 2 > 2/cpu.shares
>> # echo $((2**18)) > 1/cpu.shares ; echo $((2**18)) > 2/cpu.shares
>>
>> The reason why such unfairness happened:
>>
>> When a sched_entity is running, if its weight is low, its vruntime
>> increases by a large value every time and if its weight
>> is high, its vruntime increases by a small value.
>>
>> So when the two sched_entity's weight is low, they will still 
>> fairness even if difference of their vruntime is large, but if
>> their weight are upgraded, this large difference of vruntime
>> will bring unfairness.
>>
>> example:
>> se1's vruntime         se2's vruntime
>>     1000M               (R) 1020M
>> 	(assume vruntime is increases by about 50M every time)
>> (R) 1050M                   1020M 
>>     1050M               (R) 1070M
>> (R) 1100M                   1070M
>>     1100M               (R) 1120M 
>> 	(fairness, even if difference of their vruntime is large)
>> 	(upgrade their weight, vruntime is increases by about 10K)
>> (R) 1100M+10K               1120M
>> (R) 1100M+20K               1120M
>> (R) 1100M+30K               1120M
>> (R) 1100M+40K               1120M
>> (R) 1100M+50K               1120M
>> 	(se1 gets all cpu-usage for long time (mybe about tens
>> of seconds))
>> 	(unfairness, difference=20M is too large for new weight)
> 
> My initial response to this email was: sure, that's because you cannot
> renice two tasks atomically - we'll just have to live with that.
> 
> However after a bit more thought it occurred to me this is because we're
> changing the weight of a task with non-zero lag.
> 


IMO, that's because the next runtime of a se is *predetermined*(use
se->vruntime to determine its next runtime).

So the solution of this problem is that the next runtime must be
redetermined when weight is changed.

And my solution use MIN(cfs_rq->min_vruntime, 
cfs_rq->curr->vruntime, __pick_next_entity(cfs_rq)->vruntime)
as "current vruntime", and I suppose that pre_delta_exec < TICK_NSEC
(I suppose that local irq is disabled too long is a rare phenomena)

(And I suppose the value wakeup_gran is very small value too,
"current vruntime" is not so accurate)

But my solution don't redetermined se's next runtime when weight
is degraded. It's a big weakness.


> I think the proper solution to this problem is to scale the lag
> according to the change in weights. But lets ask James, who is an expert
> in this area.
> 
> 
> So while I think you're right in that we have an issue, I don't like
> your solution.
> 
> How about something like these patches (compile tested only).
> 


How your solution fix this:
se1's weight is biger than se2's weight at first

and then we upgrade se2's weight:
se1 = cfs_rq->curr(not in rbtree), se2 is *the only* node in the rbtree,
and se1's vruntime=1000M,  se2's vruntime = 1020M

But se2's vruntime still is 1020M after se2's weight is upgraded in your solution.

Unfairness still happen(difference=20M is too large for biger weight in).

Some times (cfs_rq->min_vruntime - cfq_rq->curr->vruntime) is a large
difference for huge weight.

>  
> +static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +		unsigned long old_weight, unsigned long new_weight)
> +{
> +	u64 avg;
> +	s64 lag;
> +
> +	if (old_weight == new_weight)
> +		return;
> +
> +	dequeue_entity(cfs_rq, se, 0);
> +
> +	avg = avg_vruntime(cfs_rq);
> +	lag = (s64)(se->vruntime - avg);
> +
> +	lag *= new_weight;

why  lag = lag * new_weight; ?

> +	lag = div_s64(lag, old_weight);
> +
> +	se->vruntime = avg + lag;

how about (s64)(se->vruntime - avg) > 0?
and how about (s64)(se->vruntime - avg) < 0?

> +
> +	enqueue_entity(cfs_rq, se, 0);
> +}
> +


----------
I suggest that combine your solution and mine:
use cfs_rq->curr_vruntime to track carefully the "current vruntime"
of this cfs_rq and:

static void prio_changed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
		unsigned long old_weight, unsigned long new_weight)
{
	u64 cfs_vruntime;
	u64 vdelta_exec;

	if (old_weight == new_weight)
		return;

	dequeue_entity(cfs_rq, se, 0);

	cfs_vruntime = cfs_rq->curr_vruntime;
	vdelta_exec = (s64)(se->vruntime - cfs_vruntime);

	if (likely(((s64)vdelta_exec) > 0)) {
		vdelta_exec *= old_weight;
		vdelta_exec = div_u64(vdelta_exec, new_weight);
	}

	se->vruntime = cfs_vruntime + vdelta_exec;

	enqueue_entity(cfs_rq, se, 0);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/