linux-kernel - Re: [PATCH v10 8/9] sched/fair: Add latency list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtDjSTEaXQ8t8gz-DS_UbMUmN3F2PnbF5m-vJcL-KxaJfQ@mail.gmail.com>
Date:   Mon, 27 Feb 2023 15:55:15 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, juri.lelli@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com,
        linux-kernel@...r.kernel.org, parth@...ux.ibm.com,
        cgroups@...r.kernel.org, qyousef@...alina.io,
        chris.hyser@...cle.com, patrick.bellasi@...bug.net,
        David.Laight@...lab.com, pjt@...gle.com, pavel@....cz,
        tj@...nel.org, qperret@...gle.com, tim.c.chen@...ux.intel.com,
        joshdon@...gle.com, timj@....org, kprateek.nayak@....com,
        yu.c.chen@...el.com, youssefesmat@...omium.org,
        joel@...lfernandes.org
Subject: Re: [PATCH v10 8/9] sched/fair: Add latency list

On Mon, 27 Feb 2023 at 14:29, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote:
> > On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@...radead.org> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> > >
> > > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > > +{
> > > > +
> > > > +     /* Only latency sensitive entity can be added to the list */
> > > > +     if (se->latency_offset >= 0)
> > > > +             return;
> > > > +
> > > > +     if (!RB_EMPTY_NODE(&se->latency_node))
> > > > +             return;
> > > > +
> > > > +     /*
> > > > +      * An execution time less than sysctl_sched_min_granularity means that
> > > > +      * the entity has been preempted by a higher sched class or an entity
> > > > +      * with higher latency constraint.
> > > > +      * Put it back in the list so it gets a chance to run 1st during the
> > > > +      * next slice.
> > > > +      */
> > > > +     if (!(flags & ENQUEUE_WAKEUP)) {
> > > > +             u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > > +
> > > > +             if (delta_exec >= sysctl_sched_min_granularity)
> > > > +                     return;
> > > > +     }
> > >
> > > I'm not a big fan of this dynamic enqueueing condition; it makes it
> > > rather hard to interpret the below addition to pick_next_entity().
> > >
> > > Let me think about this more... at the very least the comment with
> > > __pick_first_latency() use below needs to be expanded upon if we keep it
> > > like so.
> >
> > Only the waking tasks should be added in the latency rb tree so they
>
> But that's what I'm saying, you can game this by doing super short
> sleeps every min_gran.

yes, it 's for a time. But i'm not sure this will be always beneficial
at the end because most of the time you will have your full slice
without doing this game

>
> > can be selected to run 1st (as long as they don't use too much
> > runtime). But task A can wake up, preempts current task B thanks to
> > its latency nice , starts to run few usecs but then is immediately
> > preempted by a RT task C as an example. In this case, we consider that
> > the task A didn't get a chance to run after its wakeup and we put it
> > back to the latency rb tree just as if task A has just woken up but
> > didn't preempted the new current task C.
>
> So ideally, and this is where I'm very slow with thinking, that
> wakeup_preempt_entity() condition here:
>
> > > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > >               se = cfs_rq->last;
> > > >       }
> > > >
> > > > +     /* Check for latency sensitive entity waiting for running */
> > > > +     latency = __pick_first_latency(cfs_rq);
> > > > +     if (latency && (latency != se) &&
> > > > +         wakeup_preempt_entity(latency, se) < 1)
> > > > +             se = latency;
>
> should be sufficient to provide fair bandwidth usage. The EEVDF paper
> achieves this by selecting the leftmost elegible task, where elegibility
> is dependent on negative lag. Only those tasks that are behind the pack
> are allowed runtime.
>
> Now clearly our min_vruntime is unsuited for that exact scheme, but iirc
> wake_preempt_entity() does not allow for starvation, so we should be
> good, even without that weird condition in __enqueue_latency(), hmm?

If we unconditionally  __enqueue_latency() the task then it can end up
providing more bandwidth to those tasks because it's like having a
larger sleep credit than others.

The original condition in __enqueue_latency() was :
    if (!(flags & ENQUEUE_WAKEUP)) {
        return;
    }

So that task gets a chance to preempt others only at wakeup.
But then, I have seen such tasks being preempted immediately but RT
tasks and as a result lost their latency advantage. Maybe I should be
the condition above and add the weird condition in a separate patch
with dedicated figures