linux-kernel - Re: [PATCH] sched/fair: Rate limit calls to update_blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YRrhEfGcLN5yrLJZm6HrB15M_R5xfpMReG2wE2rSmVWdA@mail.gmail.com>
Date:   Thu, 28 Jan 2021 10:09:17 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Paul McKenney <paulmck@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Dietmar Eggeman <dietmar.eggemann@....com>,
        Qais Yousef <qais.yousef@....com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Uladzislau Rezki (Sony)" <urezki@...il.com>,
        Neeraj upadhyay <neeraj.iitr10@...il.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages()
 for NOHZ

Hi Vincent,

On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
<vincent.guittot@...aro.org> wrote:
> > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@...lfernandes.org> wrote:
> > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > <joel@...lfernandes.org> wrote:
> > > > > >
> > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > the periodic balancer.
> > > > > >
> > > > > > Closer look shows that the problem is caused by the following
> > > > > > ingredients:
> > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > __update_blocked_fair() take a long time.
> > > > >
> > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > the duration
> > > >
> > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > decayed and not have any tasks running on it but its load needs to be
> > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > 'blocked'.
> > >
> > > How many blocked cgroups have you got ?
> >
> > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > this loop runs per new idle balance. When the problem happens I see this loop
> > run 35-40 times (for one single instance of newidle balance). So in total
> > there are at least these many cfs_rq load updates.
>
> Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?

All CPUs.

> > I also see that new idle balance can be called 200-500 times per second.
>
> This is not surprising because newidle_balance() is called every time
> the CPU is about to become idle

Sure.

> > > >
> > > >                   * There can be a lot of idle CPU cgroups.  Don't let fully
> > > >                   * decayed cfs_rqs linger on the list.
> > > >                   */
> > > >                  if (cfs_rq_is_decayed(cfs_rq))
> > > >                          list_del_leaf_cfs_rq(cfs_rq);
> > > >
> > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > >
> > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > >
> > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > frequency update needs to go through utilization of other CPUs in the
> > > > cluster. I believe this could be adding to the problem but is not really
> > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > to begin with.
> > >
> > > Qais mentioned half of the time being used by
> > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > this call ?
> >
> > I do not see a frequency update happening at the time of the problem. However
> > note that sugov_iowait_boost() does run even if frequency is not being
> > updated. IIRC, this function is also not that light weight and I am not sure
> > if it is a good idea to call this that often.
>
> Scheduler can't make any assumption about how often schedutil/cpufreq
> wants to be called. Some are fast and straightforward and can be
> called very often to adjust frequency; Others can't handle much
> updates. The rate limit mechanism in schedutil and io-boost should be
> there for such purpose.

Sure, I know that's the intention.

> > > > > > average updates also try to update the frequency in schedutil).
> > > > > >
> > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > >
> > > > > Low frequency usually means low utilization so it should happen that much.
> > > >
> > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > >
> > > Happening a lot in itself is not a problem if there is nothing else to
> > > do so it's not a argument in itself
> >
> > It is a problem - it shows up in the preempt off critical section latency
>
> But this point is not related to the point above which is about how
> often it happens.

It is related. A bad thing happening quite often is worse than a bad
thing happening infrequently. I agree it is not a root cause fix, but
it makes things "better".

> > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > micro seconds?  That hurts real-time applications (audio etc).
>
> So. Is your problem related to real-time applications (audio etc) ?

Yes it is. I don't have a reproducer and it could be the audio
buffering will hide the problem. But that doesn't mean that there is
no problem or that we cannot improve things. There will also likely be
power consumption improvement.

> > > So why is it a problem for you ? You are mentioning newly idle load
> > > balance so I assume that your root problem is the scheduling delay
> > > generated by the newly idle load balance which then calls
> > > update_blocked_averages
> >
> > Yes, the new idle balance is when I see it happen quite often. I do see it
> > happen with other load balance as well, but it not that often as those LB
> > don't run as often as new idle balance.
>
> The update of average blocked load has been added in newidle_balance
> to take advantage of the cpu becoming idle but it seems to create a
> long preempt off sequence. I 'm going to prepare a patch to move it
> out the schedule path.

Ok thanks, that would really help!

> > > rate limiting the call to update_blocked_averages() will only reduce
> > > the number of time it happens but it will not prevent it to happen.
> >
> > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > often. In this case though, it is frequent.
>
> Could you provide details of the problem that you are facing ? It's
> not clear for me what happens in your case at the end. Have you got an
> audio glitch as an example?
>
> "Not often" doesn't really give any clue.

I believe from my side I have provided details. I shared output from a
function graph tracer and schbench micro benchmark clearly showing the
issue and improvements. Sorry, I don't have a real-world reproducer
for this.

> Also update_blocked_averages was supposed called in newlyidle_balance
> when the coming idle duration is expected to be long enough

No, we do not want the schedule loop to take half a millisecond.

> > > IIUC, your real problem is that newidle_balance is running whereas a
> > > task is about to wake up on the cpu and we don't abort quickly during
> > > this load_balance
> >
> > Yes.
> >
> > > so we could also try to abort earlier in case of newly idle load balance
> >
> > I think interrupts are disabled when the load balance runs, so there's no way
> > for say an audio interrupt to even run in order to wake up a task. How would
> > you know to abort the new idle load balance?
> >
> > Could you elaborate more also on the drawback of the rate limiting patch we
> > posted? Do you see a side effect?
>
> Your patch just tries to hide your problem and not to solve the root cause.

Agreed, the solution presented is a band aid and not a proper fix. It
was just intended to illustrate the problem and start a discussion. I
should have marked it RFC for sure.

thanks!

- Joel