linux-kernel - Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <37c7fe49-1d1f-2930-c90f-88645db398ce@arm.com>
Date:   Mon, 9 Oct 2017 13:15:04 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org, tj@...nel.org,
        josef@...icpanda.com, torvalds@...ux-foundation.org,
        vincent.guittot@...aro.org, efault@....de, pjt@...gle.com,
        clm@...com, morten.rasmussen@....com, bsegall@...gle.com,
        yuyang.du@...el.com
Subject: Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq
 and its se

On 06/10/17 14:02, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 08:27:01PM +0100, Dietmar Eggemann wrote:
>> On 01/09/17 14:21, Peter Zijlstra wrote:
>>> The PELT _sum values are a saw-tooth function, dropping on the decay
>>> edge and then growing back up again during the window.
>>>
>>> When these window-edges are not aligned between cfs_rq and se, we can
>>> have the situation where, for example, on dequeue, the se decays
>>> first.
>>>
>>> Its _sum values will be small(er), while the cfs_rq _sum values will
>>> still be on their way up. Because of this, the subtraction:
>>> cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
>>> will then, once the cfs_rq reaches an edge, translate into its _avg
>>> value jumping up.
>>>
>>> This is especially visible with the runnable_load bits, since they get
>>> added/subtracted a lot.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>> ---
>>>  kernel/sched/fair.c |   45 +++++++++++++++++++++++++++++++--------------
>>>  1 file changed, 31 insertions(+), 14 deletions(-)
>>
>> [...]
>>
>>> @@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
>>>   */
>>>  static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>  {
>>> +	u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>>> +
>>> +	/*
>>> +	 * When we attach the @se to the @cfs_rq, we must align the decay
>>> +	 * window because without that, really weird and wonderful things can
>>> +	 * happen.
>>> +	 *
>>> +	 * XXX illustrate
>>> +	 */
>>>  	se->avg.last_update_time = cfs_rq->avg.last_update_time;
>>> +	se->avg.period_contrib = cfs_rq->avg.period_contrib;
>>> +
>>> +	/*
>>> +	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
>>> +	 * period_contrib. This isn't strictly correct, but since we're
>>> +	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
>>> +	 * _sum a little.
>>> +	 */
>>> +	se->avg.util_sum = se->avg.util_avg * divider;
>>> +
>>> +	se->avg.load_sum = divider;
>>> +	if (se_weight(se)) {
>>> +		se->avg.load_sum =
>>> +			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
>>> +	}
>>
>> Can scale_load_down(se->load.weight) ever become 0 here?
> 
> Yeah, don't see why not.

Tasks should be safe since for them scale_load_down(se->load.weight)
should be >= 15 when they get attached.

task groups get attached via attach_entity_cfs_rq()

# mkdir /sys/fs/cgroup/foo

[   63.885333] [<ffff00000811a558>] attach_entity_load_avg+0x138/0x558
[   63.885345] [<ffff0000081208e0>] attach_entity_cfs_rq+0x298/0x998
[   63.885357] [<ffff00000812d660>] online_fair_sched_group+0x70/0xb0
[   63.885369] [<ffff0000081182d4>] sched_online_group+0x94/0xb0
[   63.885381] [<ffff000008118318>] cpu_cgroup_css_online+0x28/0x38
[   63.885393] [<ffff0000081a5e80>] online_css+0x38/0xd0
[   63.885406] [<ffff0000081acae0>] cgroup_apply_control_enable+0x260
[   63.885418] [<ffff0000081afc54>] cgroup_mkdir+0x314/0x4e8

mkdir-2501  [004]    63.689455: bprint:
attach_entity_load_avg: cpu=1 se=0xffff800072fffc00 load.weight=1048576

If I apply the smallest shares possible to that task group, it is
already attched.

# echo 2 > /sys/fs/cgroup/foo/cpu.shares

create_cpu_cgro-2500  [003]    64.591878: bprint:
update_cfs_group: cpu=1 se=0xffff800072fffc00 tg->shares=2048 shares=0
load=0 tg_weight=0
 create_cpu_cgro-2500  [003]    64.591904: bprint:
reweight_entity: cpu=1 se=0xffff800072fffc00 se->load.weight=2

I can't see right now how a task group can get attached with
se->load.weight < 1024 (32 bit)/1048576 (64 bit)? Do I miss something?