linux-kernel - Re: [PATCH 1/2] sched/fair: Record the average duration of a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db81ba7fba622e2a1b7186e66471cfb9ad8490fd.camel@gmx.de>
Date: Mon, 01 Jul 2024 08:57:25 +0200
From: Mike Galbraith <efault@....de>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
 Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Tim Chen <tim.c.chen@...el.com>, Yujie Liu
 <yujie.liu@...el.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R
 . Shenoy" <gautham.shenoy@....com>,  Chen Yu <yu.chen.surf@...il.com>,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched/fair: Record the average duration of a task

On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> Hi Mike,
>
> Thanks for your time and giving the insights.
>
> On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0935f9d4bb7b..7399c4143528 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >         p->migration_pending = NULL;
> > >  #endif
> > >         init_sched_mm_cid(p);
> > > +       p->prev_sleep_sum_runtime = 0;
> > > +       p->duration_avg = 0;
> > >  }
> >
> > Beginning life biased toward stacking?
> >
>
> OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> task stacking in the beginning.

Or something, definitely.
   
> > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 41b58387023d..445877069fbf 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > >
> > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  
> > >  dequeue_throttle:
> > >         util_est_update(&rq->cfs, p, task_sleep);
> > > +       if (task_sleep)
> > > +               dur_avg_update(p);
> > > +
> > >         hrtick_update(rq);
> > >  }
> > >
> >
> > That qualifier looks a bit dangerous.  Microbench components tend to
> > have only one behavior, but the real world goes through all kinds of
> > nutty gyrations, intentional and otherwise.
> >
>
> Understand. Unfortunately I don't have access to production environment
> so I have to rely on microbenchmarks and a OLTP to check the result. I
> get feedback from Abel that the former version of this patch brought
> benefit to some short tasks like Redis in the production environment[1].
> https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

Here's hoping you get a lot more.

> I can launch a combination of microbenchmarks in parallel to check the impact.
>
> > The heuristics in the next patch seem to exhibit a healthy level of
> > paranoia, but these bits could perhaps use a tad more.  Bad experiences
> > springs to mind when I stare at that - sleepers going hog, hogs meet
> > sleeping lock contention, preemption, sync hint not meaning much...
> >
>
> I see. If I understand correctly, the scenario mentioned above
> could bring a false positive of 'short task', which causes
> task stacking.

Yeah.

> If the sleeper task:
> 1. is preempted frequently. This should not be a problem, because
>    the task duration is unlikely to be shorten by preemption, and
>    the short_task() is unlikely to return true.

Ah, but it not being a problem would beg the question why bother?
There are consequences either way, the dark side is just scarier.

> 2. meets sleeping lock contention. This would be a false positive 'short task',
>    which bring unexpected task stacking.

Yeah.

> So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> scenario that the task is waiting for the real resource, rather than blocking
> on a random sleeping lock. And the wait_woken() is widely used by the driver to
> indicate that task is waiting for the resource. For example, the netperf calltrace:
>
>     schedule_timeout+222
>     wait_woken+84
>     sk_wait_data+378
>     tcp_recvmsg_locked+507
>     tcp_recvmsg+115
>     inet_recvmsg+90
>     sock_recvmsg+150
>
> In the future, if there is requirement other scenario could also invoke the newly
> introduced update_cur_duration() when needed. For example, the pipe_read() could
> use it if the task is going to sleep due to the empty pipe buffer. I change the
> code as below, may I have your suggestion on this?

I don't have any suggestions that will help plug the holes, heck, I
squabbled in this arena quite a bit some years ago, and did not win.
Frankly I don't think the scheduler has the information necessary to do
so, it'll always be a case of this will likely do less harm than good,
but will certainly leave at least an occasional mark.

Just take a look at the high speed ping-pong thing (not a benchmark,
that's a box full of tape measures, rather silly, but..).  TCP_RR IS
1:1, has as short a duration as network stack plus scheduler can
possibly make it, and is nearly synchronous to boot, two halves of a
whole, the ONLY thing you can certainly safely stack.. but a shared L2
box still takes a wee hit when you do so.  1:N or M:N tasks can
approach its wakeup frequency range, and there's nothing you can do
about the very same cache to cache latency you're trying to duck, it
just is what it is, and is considered perfectly fine as it is.  That's
a bit of a red flag, but worse is the lack of knowledge wrt what tasks
are actually up to at any given time.  We rashly presume that tasks
waking one another implies a 1:1 relationship, we routinely call them
buddies and generally get away with it.. but during any overlap they
can be doing anything including N way data share, and regardless of
what that is and section size, needless stacking flushes concurrency,
injecting service latency in its place, cost unknown.

Intentional stacking can be jokingly equated to injecting just a wee
bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)

	-Mike