linux-kernel - Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190124090755.GC13536@hirez.programming.kicks-ass.net>
Date:   Thu, 24 Jan 2019 10:07:55 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Patrick Bellasi <patrick.bellasi@....com>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thara Gopinath <thara.gopinath@...aro.org>,
        pkondeti@...eaurora.org, Quentin Perret <quentin.perret@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
Subject: Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT


Sorry; trying to get back to this and re-reading the old conversations.

On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote:
> On 29-Nov 13:53, Peter Zijlstra wrote:
> > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:
> > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ac855b2f4774..93e0cf5d8a76 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > >  	if (!task_sleep)
> > >  		return;
> > > 
> > > +	/* Skip samples which do not represent an actual utilization */
> > > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))
> > > +		return;
> > > +
> > >  	/*
> > >  	 * If the PELT values haven't changed since enqueue time,
> > >  	 * skip the util_est update.
> > 
> > Would you not want something like:
> > 
> > 	min(task_util(p), capacity_of(task_cpu(p)))
> > 
> > And is this the only place where we need this?
> 
> Mmm... even this could be an over-estimation:
> 
> I've just posted an example in my last reply to Vincent, end of:
> 
>    Message-ID: <20181129150020.GF23094@...0439-lin>
>    https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

In particular this bit:

 | Seems we agree that, when there is no idle time:
 | - the two 15% tasks will be overestimated
 | - their utilization will reach 50% after a while

Right?

> > OTOH, if the task is always running, it will be always running
> > irrespective of where it runs.
> 
> That's not what I'm concerned about. I'm concerned about small tasks
> which are running on limited capacity (e.g. due to thermal capping)
> without idle time. In this case, the new "utilization" signal could
> overestimate the real task needs.
> 
> > Not storing these samples seems weird though; this is the exact
> > condition you want to record -- the task is very active, if we skip
> > these, we'll come back at a low frequency on the next wakeup.
> 
> When there is not idle time, we don't know if the reported
> utilization, above the cpu capacity, is due to the task being bigger...
> or just the new utilization signal converging towards:
> 
>     100% / RUNNABLE_TASKS_COUNT

So if I'm not mistaken we then have 3 cases:

 1) runnable == util <= capacity

    no contention, idle

 2) runnable == util > capacity

    no contention, no idle

 3) runnable > util

    contention, no idle

For 1) we can use: 'util'
For 2) we can use: 'capacity'
For 3) we can use: 'util * capacity >> 10'

(note that 2 is a special case of 3 when u=1)

This should work right?

Now, instead of doing complicated things like that, you instead figure
that when there's no idle there's also no dequeue happening and we can
simply short-cut by skipping the entire thing, forgetting everything
about 2,3.

Did I get that right?