linux-kernel - Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtAv7vPGYAwUSmGL5wtbY=if8G+3geWMKpHu3vLGqthPfg@mail.gmail.com>
Date:   Wed, 27 Oct 2021 10:49:35 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tim Chen <tim.c.chen@...ux.intel.com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort

On Tue, 26 Oct 2021 at 19:25, Tim Chen <tim.c.chen@...ux.intel.com> wrote:
>
> On Tue, 2021-10-19 at 14:35 +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in
> > the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the
> > field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost
> > has
> > been removed as the 500us value is quite large and prevent
> > opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> >        min    avg   max
> > SMT:   1us   33us  273us - this one includes the update of blocked
> > load
> > MC:    7us   49us  398us
> > NUMA: 10us   45us  158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group      tip/sched/core     + this patchset
> > 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> > 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> > 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> > 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> > 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> > 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> > 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> >
> > tbench and reaim have not shown any difference
> >
>
> Vincent,
>
> Our benchmark team tested the patches for our OLTP benchmark
> on a 2 socket Cascade Lake
> with 28 cores/socket.  It is a smaller configuration
> than the 2 socket Ice Lake we hae tested previously that has 40
> cores/socket so the overhead on update_blocked_averages is smaller
> (~4%).
>
> Here's a summary of the results:
>                                         Relative Performance
>                                         (higher better)
> 5.15 rc4 vanilla (cgroup disabled)      100%
> 5.15 rc4 vanilla (cgroup enabled)       96%
> patch v2                                96%
> patch v3                                96%
>
> We didn't see much change in performance from the patch set.

Thanks for testing.
I was not expecting this patch to fix all your problems but at least
those which have been highlighted during previous exchanges.

Few problems still remain in your case if I'm not wrong:
There is a patch that ensures that rq->next_balance is never set in the past.
IIRC, you also mentioned in a previous thread that the idle LB
happening during the tick just after the new idle balance was a
problem in your case. Which is probably linked to your figures below

>
> Looking at the profile on update_blocked_averages a bit more,
> the majority of the call to update_blocked_averages
> happens in run_rebalance_domain.  And we are not
> including that cost of update_blocked_averages for
> run_rebalance_domains in our current patch set. I think
> the patch set should account for that too.

nohz_newidle_balance keeps using sysctl_sched_migration_cost to
trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
This would probably benefit to take into account the cost of
update_blocked_averages instead

>
>
>       0.60%     0.00%             3  [kernel.vmlinux]    [k] run_rebalance_domains                                                                                                                                                  -      -
>             |
>              --0.59%--run_rebalance_domains
>                        |
>                         --0.57%--update_blocked_averages
>
> Thanks.
>
> Tim
>