linux-kernel - Re: [RFC PATCH 1/6] sched/uclamp: Track uclamped util_avg in sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cda9bff0-1fcb-4736-93e7-19659cac9a01@arm.com>
Date:   Thu, 9 Nov 2023 16:05:29 +0000
From:   Hongyan Xia <hongyan.xia2@....com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Cc:     Qais Yousef <qyousef@...alina.io>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Lukasz Luba <lukasz.luba@....com>,
        Christian Loehle <christian.loehle@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/6] sched/uclamp: Track uclamped util_avg in
 sched_avg

On 31/10/2023 15:52, Dietmar Eggemann wrote:
> On 04/10/2023 11:04, Hongyan Xia wrote:
>> From: Hongyan Xia <hongyan.xia2@....com>
> 
> [...]
> 
>> @@ -6445,6 +6450,21 @@ static int sched_idle_cpu(int cpu)
>>   }
>>   #endif
>>   
>> +void ___update_util_avg_uclamp(struct sched_avg *avg, struct sched_entity *se);
> 
> IMHO, `struct sched_avg *avg` can only be the one of a cfs_rq. So
> passing a cfs_rq would eliminate the question whether this can be from
> another se.

At the moment, yes, the avg can only come from cfs_rq. The reason why I 
kept sched_avg is that once we have sum aggregation for RT tasks, it's 
very likely we will end up calling this function on rt_rq->avg, so 
having just sched_avg here will work for RT in the future.

>> +static void update_se_chain(struct task_struct *p)
>> +{
>> +#ifdef CONFIG_UCLAMP_TASK
>> +	struct sched_entity *se = &p->se;
>> +	struct rq *rq = task_rq(p);
>> +
>> +	for_each_sched_entity(se) {
>> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +
>> +		___update_util_avg_uclamp(&cfs_rq->avg, se);
>> +	}
>> +#endif
>> +}
>>   /*
>>    * The enqueue_task method is called before nr_running is
>>    * increased. Here we update the fair scheduling stats and
>> @@ -6511,6 +6531,16 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>   			goto enqueue_throttle;
>>   	}
>>   
>> +	/*
>> +	 * Re-evaluate the se hierarchy now that on_rq is true. This is
>> +	 * important to enforce uclamp the moment a task with uclamp is
>> +	 * enqueued, rather than waiting a timer tick for uclamp to kick in.
>> +	 *
>> +	 * XXX: This duplicates some of the work already done in the above for
>> +	 * loops.
>> +	 */
>> +	update_se_chain(p);
> 
> I try to figure out why this is necessary here:
> 
> enqueue_task_fair()
>    for_each_sched_entity()
>      enqueue_entity()
>        update_load_avg()
>          __update_load_avg_se()
>            ___update_util_avg_uclamp()        <-- if se->on_rq,
>                                                   update p (se) if we
>                                                   cross PELT period (1)
>                                                   boundaries
>            ___decay_util_avg_uclamp_towards() <-- decay p (se)      (2)
> 
>        enqueue_util_avg_uclamp()          <-- enqueue p into cfs_rq (3)
> 
>        se->on_rq = 1                          <-- set p (se) on_rq  (4)
> 
>    for_each_sched_entity()
>      update_load_avg()                        <-- update all on_rq se's
>                                                   in the hierarchy  (5)
> 
>    update_se_chain()                          <-- update p (se) and its
>                                                   se hierarchy      (6)
> 
> (1) Skip p since it is !se->on_rq.
> 
> (2) Decay p->se->avg.util_avg_uclamp to 0 since it was sleeping.
> 
> (3) Attach p to its cfs_rq
> 
> ...
> 
> (6) Update all all se's and cfs_rq's involved in p's task_group
>      hierarchy (including propagation from se (level=x+1) to cfs_rq
>      (level=x))
> 
> Question for me is why can't you integrate the util_avg_uclamp signals
> for se's and cfs_rq's/rq's much closer into existing PELT functions?

So the problem is that when we enqueue the task (say UCLAMP_MIN of 200), 
at that exact moment se->on_rq is false, so we only decay and not 
enforce UCLAMP_MIN. Further up in the hierarchy we do update_load_avg(), 
but the se of the task has already been processed so UCLAMP_MIN has not 
taken any effect. To make sure UCLAMP_MIN is immediately effective, I 
just re-evaluate the whole hierarchy from bottom to top.

I probably didn't quite catch what you said here. Could you elaborate a 
bit on what 'much closer' means?

Hongyan