linux-kernel - Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAu33AN6=X82T=yOgm40S8OXi+sPcF0QyD-bYRPV=xPEg@mail.gmail.com>
Date:   Tue, 18 Jul 2023 18:01:51 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Yu Chen <yu.c.chen@...el.com>,
        Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for
 cfs_rq's removed load

On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@...el.com> wrote:
>
> When a workload involves many wake time task migrations, tg->load_avg
> can be heavily contended among CPUs because every migration involves
> removing the task's load from its src cfs_rq and attach that load to
> its new cfs_rq. Both the remove and attach requires an update to
> tg->load_avg as well as propagating the change up the hierarchy.
>
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions. Since the workload can trigger many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. For the above said workload, the profile shows
> update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> migration number is 14millions; update_cfs_group() costs ~25% and
> update_load_avg() costs ~16%.
>
> This patch is an attempt to reduce the cost of accessing tg->load_avg.

here you mention tg->load_avg which is updated with update_tg_load_avg()

but your patch below modifies the local update of cfs's util_avg,
runnable_avg  and load_avg which need to be maintained up to date

You should be better to delay or rate limit the call to
update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
access tg->load_avg and are the culprit instead of modifying other
place.

Have you tried to remove update_cfs_group() from enqueue/dequeue and
only let the tick update the share periodically ?

Have you tried to make update_tg_load_avg() return early ? the change
of cfs' load_avg is written in the tg->load_avg only when the change
is bigger than 1.5%. maybe increase it to 6% ?

Or like for update_cfs_group, remove it from attach/detach entity and
let the periodic update to propagate the change

But please don't touch local update of *_avg

>
> Current logic will immediately do a update_tg_load_avg() if cfs_rq has
> removed load; this patch changes this behavior: if this cfs_rq has
> removed load as discovered by update_cfs_rq_load_avg(), it didn't call
> update_tg_load_avg() or propagate the removed load immediately, instead,
> the update to tg->load_avg and propagated load can be dealed with by a
> following event like task attached to this cfs_rq or in
> update_blocked_averages(). This way, the call to update_tg_load_avg()
> for this cfs_rq and its ancestors can be reduced by about half.
>
> ================================================
> postgres_sysbench(transaction, higher is better)
> nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
> results that have a measuable difference are:
>
> nr_thread=100% on SPR:
> base:   90569.11±1.15%
> node:  104152.26±0.34%  +15.0%
> delay: 127309.46±4.25%  +40.6%
>
> nr_thread=75% on SPR:
> base:  100803.96±0.57%
> node:  107333.58±0.44%   +6.5%
> delay: 124332.39±0.51%  +23.3%
>
> nr_thread=75% on ICL:
> base:  61961.26±0.41%
> node:  61585.45±0.50%
> delay: 72420.52±0.14%  +16.9%
>
> =======================================================================
> hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
> group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
> results that have a measuable difference are:
>
> group=8 on SPR:
> base:  437163±2.6%
> node:  471203±1.2%   +7.8%
> delay: 490780±0.9%  +12.3%
>
> group=16 on SPR:
> base:  468279±1.9%
> node:  580385±1.7%  +23.9%
> delay: 664422±0.2%  +41.9%
>
> ================================================
> netperf/TCP_STRAM (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have a measuable difference are:
>
> nr_thread=50% on CSL:
> base:  16258±0.7%
> node:  16172±2.9%
> delay: 17729±0.7%  +9.0%
>
> nr_thread=75% on CSL:
> base:  12923±1.2%
> node:  13011±2.2%
> delay: 15452±1.6%  +19.6%
>
> nr_thread=75% on SPR:
> base:  16232±11.9%
> node:  13962±5.1%
> delay: 21089±0.8%  +29.9%
>
> nr_thread=100% on SPR:
> base:  13220±0.6%
> node:  13113±0.0%
> delay: 18258±11.3% +38.1%
>
> =============================================
> netperf/UDP_RR (throughput, higher is better)
> nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
> Lake and results that have measuable difference are:
>
> nr_thread=1 on CSL:
> base:  128521±0.5%
> node:  127935±0.6%
> delay: 126317±0.4%  -1.7%
>
> nr_thread=75% on CSL:
> base:  36701±1.7%
> node:  39949±1.4%   +8.8%
> delay: 42516±0.3%  +15.8%
>
> nr_thread=75% on SPR:
> base:  14249±3.8%
> node:  19890±2.0%   +39.6%
> delay: 31331±0.5%  +119.9%
>
> nr_thread=100% on CSL:
> base:  52275±0.6%
> node:  53827±0.4%   +3.0%
> delay: 78386±0.7%  +49.9%
>
> nr_thread=100% on SPR:
> base:   9560±1.6%
> node:  14186±3.9%   +48.4%
> delay: 20779±2.8%  +117.4%
>
> Signed-off-by: Aaron Lu <aaron.lu@...el.com>
> ---
>  kernel/sched/fair.c  | 23 ++++++++++++++++++-----
>  kernel/sched/sched.h |  1 +
>  2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aceb8f5922cb..564ffe3e59c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         if (child_cfs_rq_on_list(cfs_rq))
>                 return false;
>
> +       if (cfs_rq->prop_removed_sum)
> +               return false;
> +
>         return true;
>  }
>
> @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>  {
>         cfs_rq->propagate = 1;
>         cfs_rq->prop_runnable_sum += runnable_sum;
> +
> +       if (cfs_rq->prop_removed_sum) {
> +               cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
> +               cfs_rq->prop_removed_sum = 0;
> +       }
>  }
>
>  /* Update task and its cfs_rq load average */
> @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> -
> -               decayed = 1;
> +               cfs_rq->prop_removed_sum +=
> +                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
>         }
>
> -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
>         u64_u32_store_copy(sa->last_update_time,
>                            cfs_rq->last_update_time_copy,
>                            sa->last_update_time);
> @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
>
>                         if (cfs_rq == &rq->cfs)
>                                 decayed = true;
> +
> +                       /*
> +                        * If the aggregated removed_sum hasn't been taken care of,
> +                        * deal with it now before this cfs_rq is removed from the list.
> +                        */
> +                       if (cfs_rq->prop_removed_sum)
> +                               add_tg_cfs_propagate(cfs_rq, 0);
>                 }
>
>                 /* Propagate pending load changes to the parent, if any: */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9cece2dbc95b..ab540b21d071 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -619,6 +619,7 @@ struct cfs_rq {
>         unsigned long           tg_load_avg_contrib;
>         long                    propagate;
>         long                    prop_runnable_sum;
> +       long                    prop_removed_sum;
>
>         /*
>          *   h_load = weight * f(tg)
> --
> 2.41.0
>