linux-kernel - Re: [PATCH] sched/fair: Rate limit calls to update_blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAKfTPtC157Z2vsnW3MLqKcMBYB-0D255rYr1Y-vD5xYDLBNoVQ@mail.gmail.com>
Date:   Fri, 25 Jun 2021 10:50:12 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tim Chen <tim.c.chen@...ux.intel.com>
Cc:     Qais Yousef <qais.yousef@....com>,
        Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Paul McKenney <paulmck@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Dietmar Eggeman <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Uladzislau Rezki (Sony)" <urezki@...il.com>,
        Neeraj upadhyay <neeraj.iitr10@...il.com>,
        Aubrey Li <aubrey.li@...ux.intel.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages()
 for NOHZ

On Fri, 18 Jun 2021 at 18:14, Tim Chen <tim.c.chen@...ux.intel.com> wrote:
>
>
>
> On 6/18/21 3:28 AM, Vincent Guittot wrote:
>
> >>
> >> The current logic is when a CPU becomes idle, next_balance occur very
> >> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> >> the next_balance in the next jiffie if the CPU is idle.  However, in
> >> reality, I saw most CPUs are 95% busy on average for my workload and
> >> a task will wake up on an idle CPU shortly.  So having frequent idle
> >> balancing towards shortly idle CPUs is counter productive and simply
> >> increase overhead and does not improve performance.
> >
> > Just to make sure that I understand your problem correctly:  Your problem is:
> > - that we have an ilb happening on the idle CPU and consume cycle
>
> That's right.  The cycles are consumed heavily in update_blocked_averages()
> when cgroup is enabled.

But they are normally consumed on an idle CPU and the ILB checks
need_resched()  before running load balance for the next idle CPU.

Does it mean that your problem is coming from update_blocked_average()
spending a long time with rq_lock_irqsave and increasing the wakeup
latency of your short running task ?

>
> > - or that the ilb will pull a task on an idle CPU on which a task will
> > shortly wakeup which ends to 2 tasks competing for the same CPU.
> >
>
> Because for the OLTP workload I'm looking at, we have tasks that sleep
> for a short while and wake again very shortly (i.e. the CPU actually
> is ~95% busy on average), pulling tasks to such a CPU is really not
> helpful to improve overall CPU utilization in the system.  So my
> intuition is for such almost fully busy CPU, we should defer load
> balancing to it (see prototype patch 3).

Note that this is at the opposite of what you said earlier:
"
Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage
migrations to idle cpu
and reduce latency.
"

But, it will be quite hard to find a value that fits to requirements
for everybody and some will have UCs for which they want to pull tasks
even if the CPU is 95% busy; You can have 2ms of idle time but having
a utilization above 95% and an ILB inside a Core or at LLC is somewhat
cheap and would take advantage of those 2ms

>
> Tim
>
>
>
>