linux-kernel - Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAQuX5ZbzOH_LnFbBRWErP9pcnAVMvVE9qQw1LXouwzog@mail.gmail.com>
Date:   Mon, 11 Jan 2021 15:36:57 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Qais Yousef <qais.yousef@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Jiang Biao <benbjiang@...il.com>
Subject: Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

On Fri, 8 Jan 2021 at 17:14, Mel Gorman <mgorman@...hsingularity.net> wrote:
>
> On Fri, Jan 08, 2021 at 04:10:51PM +0100, Vincent Guittot wrote:
> > > > Trying to bias the avg_scan_cost with:  loops <<= 2;
> > > > will just make avg_scan_cost lost any kind of meaning because it
> > > > doesn't reflect the avg cost of scanning a rq anymore
> > > >
> > >
> > > Before the series, the avg_scan_cost also did not represent the cost of
> > > scanning a RQ before either. Treating scan failures and successes equally
> >
> > I agree that the previous avg_scan_cost was not representing a RQ
> > because it was the avg cost of scanning the full domain.
>
> It was not even that. As the full domain was not necessarily scanned at
> all, it simply reflected how much time was spent scanning in general. It
> neither represented an rq scan cost (which is variable due to cache
> traffic) nor did it represent a full domain scan.

My point was mainly that the goal was to monitor the avg scan cost for
the domain (whatever the number of rq effectively checked) so the
duration was impacted by a successful scan that returns earlier

>
> > And we were
> > comparing it with the average idle time (weighted by few factors).
> > And this cost was impacted by the fact that the scan can return early
> > because it found a cpu. This has advantage and drawback but at least
> > stays coherent in what we are comparing
> >
>
> Not really because it represented the cost of a scan of some number of
> rqs from 1 to sd->span_weight.
>
> > Peter's patch wants to move on per rq avg scan cost. And what you're
> > proposing is to add a magic heuristic to bias the per rq which at the
> > end makes this value just an opaque metric.
> >
>
> The metric is a heuristic no matter what. The scan cost of a RQ is not
> fixed as it depends on whether cache data needs to be updated or not. Also
> bear in mind that the first round of results and the results that I posted
> showed that Peter's patch has significant corner cases that the patch
> mitigates. You also note that avg_idle is an unusual metric to compare
> against because it has numerous timing artifacts. At least one of them
> is that we are extrapolating the domain idle time from a single rq which
> is a poor proxy measure when a domain is only partially used. There just
> is not a better one available without heavy writes to sd_llc which causes
> its own set of problems.
>
> > If we really want to keep the impact of early return than IMO we
> > should stay on a full domain scan level instead of a per rq.
> >
>
> That also has the same class of side-effects. Once the scan cost of
> a successful scan is strictly accounted for, there are problems. Even
> tracking the success scans is costly as the CPU clock has to be read and
> sd_llc has to be updated.
>
> > Also, there is another problem (that I'm investigating)  which is that
> > this_rq()->avg_idle is stalled when your cpu is busy. Which means that
> > this avg_idle can just be a very old and meaningless value.
>
> Yes, avg_idle in itself is just the average inter-arrival time between
> a CPU going idle and receiving a wakeup partially bound roughly
> by 2*sysctl_sched_migration_cost. If avg_idle is traced for each
> select_idle_cpu(), it's obvious that it takes time to adjust when a
> load starts.
>
> > I think
> > that we should decay it periodically to reflect there is less and less
> > idle time (in fact no more)  on this busy CPU that never goes to idle.
> > If a cpu was idle for a long period but then a long running task
> > starts, the avg_idle will stay stalled to the large value which is
> > becoming less and less relevant.
>
> While I get what you're saying, it does not help extrapolate what the
> idleness of a domain is.

not but it gives a more up to date view of the idleness of the local
cpu which is better than a stalled value

>
> > At the opposite, a cpu with a short running/idle period task will have
> > a lower avg_idle whereas it is more often idle.
> >
> > Another thing that worries me, is that we use the avg_idle of the
> > local cpu, which is obviously not idle otherwise it would have been
> > selected, to decide how much time we should spend on looking for
> > another idle CPU. I'm not sure that's the right metrics to use
> > especially with a possibly stalled value.
> >
>
> A better estimate requires heavy writes to sd_llc. The cost of that will
> likely offset any benefit gained by a superior selection of a scan
> depth.
>
> Treating a successful scan cost and a failed scan cost as being equal has
> too many corner cases. If we do not want to weight the successful scan
> cost, then the compromise is to keep the old behaviour that accounts for

I think that keeping the current way to scane_cost id the best option for now

> scan failures only (avoids an sd_llc write at least) but base it on the
> estimated scan cost for a single rq. The fact that we do not account for
> scan failures should be explicitly commented so we do not forget it
> again in the future.
>
> --
> Mel Gorman
> SUSE Labs