linux-kernel - Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <924b853c01a5466c0bebe3bd324c46e9980dba1c.camel@gmx.de>
Date: Tue, 26 Nov 2024 07:30:52 +0100
From: Mike Galbraith <efault@....de>
To: K Prateek Nayak <kprateek.nayak@....com>, Phil Auld <pauld@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com, 
 juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com,  rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com,  linux-kernel@...r.kernel.org,
 wuyun.abel@...edance.com,  youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking
 to a busy CPU

On Tue, 2024-11-26 at 11:02 +0530, K Prateek Nayak wrote:
> Hello Mike,
> 
> On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> > [..snip..]
> > 
> > Question: did wiping off the evil leave any meaningful goodness behind?
> > 
> > ---
> > 
> > sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > V2: do not rip buddies apart, convenient on/off switch
> > 
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Signed-off-by: Mike Galbraith <efault@....de>
> > ---
> >   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
> >   kernel/sched/features.h |    5 ++++
> >   kernel/sched/sched.h    |    5 ++++
> >   3 files changed, 42 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
> >    */
> >   static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >   {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +               int dequeue = sched_feat(DEQUEUE_DELAYED);
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along in ttwu() path.
> > +                * Note: Do not rip buddies apart else chaos follows.
> > +                */
> > +               if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
> 
> Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
> encounter a delayed task?

The switch is for test convenience.

> > +                   !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
> 
> Technically, we are still looking at the last wakeup here since
> record_wakee() is only called later. If we care about 1:1 buddies,
> should we just see "current == p->last_wakee", otherwise, there is a
> good chance "p" has a m:n waker-wakee relationship in which case
> perhaps a "want_affine" like heuristic can help?

The intent is to blunt the instrument a bit. Paul should have a highly
active interrupt source, which will give wakeup credit to whatever is
sitting on that CPU, breaking 1:1 connections.. a little bit.   That's
why it still migrates tbench buddies, but NOT at a rate that turns a
tbench progression into a new low regression.  The hope is that the
load shift caused by that active interrupt source is enough to give
Paul's regression some of the help it demonstrated wanting, without the
collateral damage.  It might now be so weak as to not meet the
"meaningful" in my question, in which case it lands on the ginormous
pile of meh, sorta works, but why would anyone care.

> For science, I was wondering if the decision to dequeue + migrate or
> requeue the delayed task can be put off until after the whole
> select_task_rq() target selection (note: without the h_nr_delayed
> stuff, some of that wake_affine_idle() logic falls apart). Hackbench
> (which saw some regression with EEVDF Complete) seem to like it
> somewhat, but it still falls behind NO_DELAY_DEQUEUE.

You can, with a few more fast path cycles and some duplication, none of
which looks very desirable.
 
>     ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      NO_DELAY_DEQUEUE        Mike's v2    Full ttwu + requeue/migrate
>      5.76                  5.72  (  1% )        5.82  ( -1% )
>      6.53                  6.56  (  0% )        6.65  ( -2% )
>      6.79                  7.04  ( -4% )        7.02  ( -3% )
>      6.91                  7.04  ( -2% )        7.03  ( -2% )
>      7.63                  8.05  ( -6% )        7.88  ( -3% )
> 
> Only subtle changes in IBS profiles; there aren't any obvious shift
> in hotspots with hackbench at least. Not sure if it is just the act of
> needing to do a dequeue + enqueue from the wakeup context that adds to
> the overall regression.

Those numbers say say to me that hackbench doesn't care deeply.  That
works for me, because I don't care deeply about nutty fork bombs ;-)

	-Mike