linux-kernel - Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8629CB9F-AFC8-43D6-BD14-B60A0B85ADB3@tencent.com>
Date:   Thu, 6 Aug 2020 15:52:02 +0000
From:   benbjiang(蒋彪) <benbjiang@...cent.com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
CC:     Jiang Biao <benbjiang@...il.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "bsegall@...gle.com" <bsegall@...gle.com>,
        "mgorman@...e.de" <mgorman@...e.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched/fair: reduce preemption with IDLE tasks
 runable(Internet mail)

Hi, 

> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> 
> On 03/08/2020 13:26, benbjiang(蒋彪) wrote:
>> 
>> 
>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>> 
>>> On 01/08/2020 04:32, Jiang Biao wrote:
>>>> From: Jiang Biao <benbjiang@...cent.com>
>>>> 
>>>> No need to preempt when there are only one runable CFS task with
>>>> other IDLE tasks on runqueue. The only one CFS task would always
>>>> be picked in that case.
>>>> 
>>>> Signed-off-by: Jiang Biao <benbjiang@...cent.com>
>>>> ---
>>>> kernel/sched/fair.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>> 
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 04fa8dbcfa4d..8fb80636b010 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4527,7 +4527,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>> 		return;
>>>> #endif
>>>> 
>>>> -	if (cfs_rq->nr_running > 1)
>>>> +	if (cfs_rq->nr_running > cfs_rq.idle_h_nr_running + 1)
>>> 
>>> cfs_rq is a pointer.
>> It is. Sorry about that. :)
>> 
>>> 
>>>> 		check_preempt_tick(cfs_rq, curr);
>>>> }
>>> 
>>> You can't compare cfs_rq->nr_running with cfs_rq->idle_h_nr_running!
>>> 
>>> There is a difference between cfs_rq->h_nr_running and
>>> cfs_rq->nr_running. The '_h_' stands for hierarchical.
>>> 
>>> The former gives you hierarchical task accounting whereas the latter is
>>> the number of sched entities (representing tasks or taskgroups) enqueued
>>> in cfs_rq.
>>> 
>>> In entity_tick(), cfs_rq->nr_running has to be used for the condition to
>>> call check_preempt_tick(). We want to check if curr has to be preempted
>>> by __pick_first_entity(cfs_rq) on this cfs_rq.
>>> 
>>> entity_tick() is called for each sched entity (and so for each
>>> cfs_rq_of(se)) of the task group hierarchy (e.g. task p running in
>>> taskgroup /A/B : se(p) -> se(A/B) -> se(A)).
>> That’s true. I was thinking adding a new cfs_rq->idle_nr_running member to
>> track the per cfs_rq's IDLE task number, and reducing preemption here based
>> on that. 
> 
> How would you deal with se's representing taskgroups which contain
> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that?
I’m not sure I get the point. :) How about the following patch,

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fa8dbcfa4d..8715f03ed6d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
                list_add(&se->group_node, &rq->cfs_tasks);
        }
 #endif
+       if (task_has_idle_policy(task_of(se)))
+               cfs_rq->idle_nr_running++;
+
        cfs_rq->nr_running++;
 }

@@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
                list_del_init(&se->group_node);
        }
 #endif
+       if (task_has_idle_policy(task_of(se)))
+               cfs_rq->idle_nr_running--;
+
        cfs_rq->nr_running--;
 }

@@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
                return;
 #endif

-       if (cfs_rq->nr_running > 1)
+       if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 &&
+           cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1)
                check_preempt_tick(cfs_rq, curr);
 }

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 877fb08eb1b0..401090393e09 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -500,6 +500,7 @@ struct cfs_bandwidth { };
 struct cfs_rq {
        struct load_weight      load;
        unsigned int            nr_running;
+       unsigned int            idle_nr_running;
        unsigned int            h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
        unsigned int            idle_h_nr_running; /* SCHED_IDLE */
> 
>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so
>> pure that could tolerate starving.
> 
> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE
> (policy)?
The case is that we need tasks(low priority, called offline tasks) to utilize the
spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without
interfering the online tasks. 
Offline tasks only run when there’s no runnable online tasks, and offline tasks
never preempt online tasks.
The SCHED_IDLE policy seems not to be abled to be qualified for that requirement,
because it has a weight(3), even though it’s small, but it can still preempt online
tasks considering the fairness. In that way, offline tasks of SCHED_IDLE policy
could interfere the online tasks.
On the other hand, idle_sched_class seems not to be qualified either. It’s too
simple and only used for per-cpu idle task currently.

Thx.
Regards,
Jiang 

> 
>> We need an absolutely low priority class that could tolerate starving, which
>> could be used to co-locate offline tasks. But IDLE class seems to be not
>> *low* enough, if considering the fairness of CFS, and IDLE class still has a
>> weight.
> 
> [...]
>