linux-kernel - Re: [PATCH 1/2] sched/fair: Record the average duration of a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZoLDxQlTR7fxoXWs@chenyu5-mobl2>
Date: Mon, 1 Jul 2024 22:57:09 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Mike Galbraith <efault@....de>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Tim Chen <tim.c.chen@...el.com>, Yujie Liu
	<yujie.liu@...el.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R .
 Shenoy" <gautham.shenoy@....com>, Chen Yu <yu.chen.surf@...il.com>,
	<linux-kernel@...r.kernel.org>, Raghavendra K T <raghavendra.kt@....com>
Subject: Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Hi Mike,

On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > Hi Mike,
> >
> > Thanks for your time and giving the insights.
> >
> > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 0935f9d4bb7b..7399c4143528 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > > >         p->migration_pending = NULL;
> > > >  #endif
> > > >         init_sched_mm_cid(p);
> > > > +       p->prev_sleep_sum_runtime = 0;
> > > > +       p->duration_avg = 0;
> > > >  }
> > >
> > > Beginning life biased toward stacking?
> > >
> >
> > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> > task stacking in the beginning.
> 
> Or something, definitely.
>    
> > > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 41b58387023d..445877069fbf 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > >
> > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  
> > > >  dequeue_throttle:
> > > >         util_est_update(&rq->cfs, p, task_sleep);
> > > > +       if (task_sleep)
> > > > +               dur_avg_update(p);
> > > > +
> > > >         hrtick_update(rq);
> > > >  }
> > > >
> > >
> > > That qualifier looks a bit dangerous.  Microbench components tend to
> > > have only one behavior, but the real world goes through all kinds of
> > > nutty gyrations, intentional and otherwise.
> > >
> >
> > Understand. Unfortunately I don't have access to production environment
> > so I have to rely on microbenchmarks and a OLTP to check the result. I
> > get feedback from Abel that the former version of this patch brought
> > benefit to some short tasks like Redis in the production environment[1].
> > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/
> 
> Here's hoping you get a lot more.
>

We recently received a simulated benchmark for Meta. I'll conduct some tests
to see the results.
 
> > So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> > dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> > wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> > scenario that the task is waiting for the real resource, rather than blocking
> > on a random sleeping lock. And the wait_woken() is widely used by the driver to
> > indicate that task is waiting for the resource. For example, the netperf calltrace:
> >
> >     schedule_timeout+222
> >     wait_woken+84
> >     sk_wait_data+378
> >     tcp_recvmsg_locked+507
> >     tcp_recvmsg+115
> >     inet_recvmsg+90
> >     sock_recvmsg+150
> >
> > In the future, if there is requirement other scenario could also invoke the newly
> > introduced update_cur_duration() when needed. For example, the pipe_read() could
> > use it if the task is going to sleep due to the empty pipe buffer. I change the
> > code as below, may I have your suggestion on this?
> 
> I don't have any suggestions that will help plug the holes, heck, I
> squabbled in this arena quite a bit some years ago, and did not win.
> Frankly I don't think the scheduler has the information necessary to do
> so, it'll always be a case of this will likely do less harm than good,
> but will certainly leave at least an occasional mark.
>

I agree. Unlike bug fixing, this kind of change usually involves trade-offs.
The attempt is to not do harm for most cases, and bring benefit for some cases.

Regarding the necessary information, non-scheduler components might have
better knowledge than the scheduler:

1. If the hint comes from user space, it could be something like
   sched_attr::sync_wakeup.
   It indicates that the task prefers sync-wakeup and is tolerant of task
   stacking. However, I'm unsure how this could be accepted by the community
   if we want to change the user space interface. What is the criteria to
   accept such change? Would the requirement from different production
   environment be considered as the endorsement?

2. If the hint comes from other components in the kernel, it could be the
   driver or others. seccomp in the current kernel code, enforces waking the
   wakee on the current CPU via the WF_CURRENT_CPU flag. However, WF_CURRENT_CPU
   seems a bit aggressive for ordinary tasks. So, wait_woken() could be used when
   needed (by the network component, for example) to indicate a possible
   cache/resource sensitivity of the wakee, and together with the task duration,
   to decide if the wakee can be placed on the current CPU.
 
> Just take a look at the high speed ping-pong thing (not a benchmark,
> that's a box full of tape measures, rather silly, but..).  TCP_RR IS
> 1:1, has as short a duration as network stack plus scheduler can
> possibly make it, and is nearly synchronous to boot, two halves of a
> whole, the ONLY thing you can certainly safely stack..

I agree, this is a limited scenario. 

> but a shared L2 box still takes a wee hit when you do so.

According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
share the same L2 cache, around 20% improvement was noticed (though not as much
as on the non-L2 shared platform). I haven't delved into the details yet, but my
understanding is that L1 cache-to-cache latency within the L2 domain might also
matter on large servers (which I need to investigate further).

> 1:N or M:N
> tasks can approach its wakeup frequency range, and there's nothing you can do
> about the very same cache to cache latency you're trying to duck, it
> just is what it is, and is considered perfectly fine as it is.  That's
> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> are actually up to at any given time.  We rashly presume that tasks
> waking one another implies a 1:1 relationship, we routinely call them
> buddies and generally get away with it.. but during any overlap they
> can be doing anything including N way data share, and regardless of
> what that is and section size, needless stacking flushes concurrency,
> injecting service latency in its place, cost unknown.
>

I believe this is a generic issue that the current scheduler faces, where
it attempts to predict the task's behavior based on its runtime. For instance,
task_hot() checks the task runtime to predict whether the task is cache-hot,
regardless of what the task does during its time slice. This is also the case
with WF_SYNC, which provides the scheduler with a hint to wake up on the current
CPU to potentially benefit from cache locality.

A thought occurred to me that one possible method to determine if the waker
and wakee share data could be to leverage the NUMA balance's numa_group data structure.
As numa balance periodically scans the task's VMA space and groups tasks accessing
the same physical page into one numa_group, we can infer that if the waker and wakee
are within the same numa_group, they are likely to share data, and it might be
appropriate to place the wakee on top of the waker.

CC Raghavendra here in case he has any insights.

> Intentional stacking can be jokingly equated to injecting just a wee
> bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)
>

I fully understand your concern, and this analogy is very interesting.
We will conduct additional tests and share the data/analysis later.

thanks,
Chenyu