linux-kernel - Re: [RFC PATCH] sched: fair: reset task_group.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtD85OSem=7RMquLWokVp7gffvDaY3mtwevkxp1mSSVVqQ@mail.gmail.com>
Date: Fri, 15 Dec 2023 10:11:54 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Imran Khan <imran.f.khan@...cle.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are
 no running tasks.

On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@...cle.com> wrote:
>
> It has been found that sometimes a task_group has some residual
> load_avg even though the load average at each of its owned queues
> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> tg_load_avg_contrib have become 0 for a long time.
> Under this scenario if another task starts running in this task_group,
> it does not get proper time share on CPU since pre-existing
> load average of task group inversely impacts the new task's CPU share
> on each CPU.
>
> This change looks for the condition when a task_group has no running
> tasks and sets the task_group's load average to 0 in such cases, so
> that tasks that run in future under this task_group get the CPU time
> in accordance with the current load.
>
> Signed-off-by: Imran Khan <imran.f.khan@...cle.com>
> ---
>

[...]

>
> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> host side.

Could it be the root cause of your problem ?

The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
then unplugged,  have not been correctly removed from tg->load_avg. If
the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
tg->load_avg should be 0 too.

Could you track that the cfs_rq->tg_load_avg_contrib is correctly
removed from tg->load_avg when you unplug the CPUs ? I can easily
imagine that the rate limit can skip some update of tg- >load_avg
while offlining the cpu

> 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
> and one instance of  CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
> /sys/fs/cgroup/test_group_2.
> It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
> much less CPU than the test_group_2, even though at this point of time both of
> these groups have only CPU hogger running.
>

[...]

> Apologies for this long description of the issue, but I have found only one way
> of reproducing this issue and using this way issue can be reproduced with 100%
> strike rate, so I thought of providing as much info as possible.
>
> I have kept the change as RFC as I was not sure if the load_avg should be
> dragged down to 0 in one go (as being done in this change) or should it be
> pulled back in steps like making it half of the previous value each time.
>
>
>  kernel/sched/fair.c  | 17 +++++++++++++++++
>  kernel/sched/sched.h |  1 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..c34d9e7abb96 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
>                 account_numa_enqueue(rq, task_of(se));
>                 list_add(&se->group_node, &rq->cfs_tasks);
> +               atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);

We have rate limited the access to the atomic tg->load_avg because it
was impacting significantly the performance.  We don't want to add a
new one at every enqueue/dequeue


>         }
>  #endif
>         cfs_rq->nr_running++;
> @@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         if (entity_is_task(se)) {
>                 account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>                 list_del_init(&se->group_node);
> +               atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
>         }
>  #endif
>         cfs_rq->nr_running--;
> @@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>                 return;
>
> +       /*
> +        * If number of running tasks for a task_group is 0, pull back its
> +        * load_avg to 0 as well.
> +        * This avoids the situation where a freshly forked task in a cgroup,
> +        * with some residual load_avg but with no running tasks, gets less
> +        * CPU time because of pre-existing load_avg of task_group.
> +        */
> +       if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
> +           && atomic_long_read(&cfs_rq->tg->load_avg)) {
> +               atomic_long_set(&cfs_rq->tg->load_avg, 0);
> +               cfs_rq->last_update_tg_load_avg = now;
> +               return;
> +       }
> +
>         delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>         if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> @@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>                 goto err;
>
>         tg->shares = NICE_0_LOAD;
> +       atomic_long_set(&tg->cfs_nr_running, 0);
>
>         init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..c3390bdb8400 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -370,6 +370,7 @@ struct task_group {
>          */
>         atomic_long_t           load_avg ____cacheline_aligned;
>  #endif
> +       atomic_long_t           cfs_nr_running ____cacheline_aligned;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> --
> 2.34.1
>