linux-kernel - Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241112124117.GA336451@pauld.westford.csb>
Date: Tue, 12 Nov 2024 07:41:17 -0500
From: Phil Auld <pauld@...hat.com>
To: Mike Galbraith <efault@....de>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	kprateek.nayak@....com, wuyun.abel@...edance.com,
	youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a
 busy CPU

On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > > 
> > > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > > 
> > > > > The below improved uptime, and trace_printk() says it's doing the
> > > > > intended, so I suppose I'll add a feature and see what falls out.
> > > > 
> > > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > > test with and without produced nothing but variance/noise.
> > > 
> > > Make it go away then.
> > > 
> > > If you could write a Changelog for you inspired bit and stick my cleaned
> > > up version under it, I'd be much obliged.
> > 
> > Salut, much obliged for eyeball relief.
> 
> Unfortunate change log place holder below aside, I think this patch may
> need to be yanked as trading one not readily repeatable regression for
> at least one that definitely is, and likely multiple others.
> 
> (adds knob)
>

Yes, I ws just coming here to reply. I have the results from the first
version of the patch (I don't think the later one fundemtally changed
enough that it will matter but those results are still pending).

Not entirely surprisingly we've traded a ~10% rand write regression for
5-10% rand read regression. This makes sense to me since the reads are
more likely to be synchronous and thus be more buddy-like and benefit
from flipping back and forth on the same cpu.  

I'd probably have to take the reads over the writes in such a trade off :)

> tbench 8
> 
> NO_MIGRATE_DELAYED    3613.49 MB/sec
> MIGRATE_DELAYED       3145.59 MB/sec
> NO_DELAY_DEQUEUE      3355.42 MB/sec
> 
> First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> I've mentioned it doing, but $subject promptly did away with that and
> then some.
>

Yep, that's not pretty. 

> I thought I might be able to do away with the reservation like side
> effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> 
>      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> 
> ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> at all.
>

I wonder if the last_wakee stuff could be leveraged here (an idle thought,
so to speak). Haven't looked closely enough. 


Cheers,
Phil

> > ---snip---
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > (de-uglified-a-lot-by)
> > 
> > Reported-by: Phil Auld <pauld@...hat.com>
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> > Signed-off-by: Mike Galbraith <efault@....de>
> > ---
> >  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
> >  kernel/sched/sched.h |    5 +++++
> >  2 files changed, 34 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> > +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> > +                       return 0;
> >                 }
> > -               ttwu_do_wakeup(p);
> > -               ret = 1;
> > +
> > +               enqueue_task(rq, p, queue_flags);
> > +       }
> > +       if (!task_on_cpu(rq, p)) {
> > +               /*
> > +                * When on_rq && !on_cpu the task is preempted, see if
> > +                * it should preempt the task that is current now.
> > +                */
> > +               wakeup_preempt(rq, p, wake_flags);
> >         }
> > -       __task_rq_unlock(rq, &rf);
> > +       ttwu_do_wakeup(p);
> >  
> > -       return ret;
> > +       return 1;
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
> >         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> >  }
> >  
> > +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> > +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> > +                   __task_rq_unlock(_T->rq, &_T->rf),
> > +                   struct rq *rq; struct rq_flags rf)
> > +
> >  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
> >                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
> >                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> > 
> > 
> 

--