[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZveiDh2/ztZTP/fH@chenyu5-mobl2>
Date: Sat, 28 Sep 2024 14:28:30 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Honglei Wang <jameshongleiwang@....com>, Oliver Sang
<oliver.sang@...el.com>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>, Chunxin Zang
<zangchunxin@...iang.com>, <linux-kernel@...r.kernel.org>, Chen Yu
<yu.chen.surf@...il.com>, kernel test robot <oliver.sang@...el.com>, "K
Prateek Nayak" <kprateek.nayak@....com>
Subject: Re: [PATCH v2] sched/eevdf: Fix wakeup-preempt by checking
cfs_rq->nr_running
Hi Honglei,
On 2024-09-27 at 21:38:53 +0800, Honglei Wang wrote:
>
>
> On 2024/9/27 14:56, Chen Yu wrote:
> > Hello Honglei,
> >
> > On 2024-09-27 at 09:54:28 +0800, Honglei Wang wrote:
> > >
> > >
> > > On 2024/9/25 16:54, Chen Yu wrote:
> > > > Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> > > > introduced a mechanism that a wakee with shorter slice could preempt
> > > > the current running task. It also lower the bar for the current task
> > > > to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
> > > > when the current task has ran out of time slice. But there is a scenario
> > > > that is problematic. Say, if there is 1 cfs task and 1 rt task, before
> > > > 85e511df3cec, update_deadline() will not trigger a reschedule, and after
> > > > 85e511df3cec, since rq->nr_running is 2 and resched is true, a resched_curr()
> > > > would happen.
> > > >
> > > > Some workloads (like the hackbench reported by lkp) do not like
> > > > over-scheduling. We can see that the preemption rate has been
> > > > increased by 2.2%:
> > > >
> > > > 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches
> > > >
> > > > Restore its previous check criterion.
> > > >
> > > > Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> > > > Reported-by: kernel test robot <oliver.sang@...el.com>
> > > > Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
> > > > Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
> > > > Tested-by: K Prateek Nayak <kprateek.nayak@....com>
> > > > Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> > > > ---
> > > > v1->v2:
> > > > Check cfs_rq->nr_running instead of rq->nr_running(K Prateek Nayak)
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 225b31aaee55..53a351b18740 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1247,7 +1247,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> > > > account_cfs_rq_runtime(cfs_rq, delta_exec);
> > > > - if (rq->nr_running == 1)
> > > > + if (cfs_rq->nr_running == 1)
> > > > return;
> > > Hi Yu,
> > >
> > > I'm wondering if commit 85e511df3cec wants to give more chances to do
> > > resched just in case there are 'short slices' tasks ready in the other cfs
> > > hierarchy.
> > > Does something like rq->cfs->nr_running == 1 make more sense? But
> > > maybe it helps less than 'cfs_rq->nr_running == 1' in this hackbench case.
> > >
> >
> > Thanks for taking a look.
> >
> > It could be possible that Peter wanted the short tasks to preempt other quickly.
> > If I understand correctly, when we say preemption, we usually consider two
> > tasks which are in the same cfs_rq(level). For example, check_preempt_wakeup_fair()
> > iterates the hierarchy from down-up until the current task and the wakee are in the
> > same level via find_matching_se(&se, &pse), then check if the wakee can preempt the
> > current. This should be consistent with the tick preemption in update_curr(). And
> > whether the short task should preempt the current is checked by
> > update_curr() -> did_preempt_short(), rather than checking the cfs_rq->nr_running/nr_h_running
> > I suppose.
> >
> Hi Yu,
>
> Yep, I understand the preemption should happen in the same cfs level. Just
> not sure the purpose of the 'nr_running check' stuff. Perhaps its role is
> just to judge whether it’s necessary to do the preemption check. If there is
> at least one more ready (cfs) task in the rq and current is not eligible, we
> take care of the waiting tasks. Thoughts?
I got your point and it makes sense. Whether the preemption check should be triggered
seems to be a heuristic trade-off to me. I'm ok with using more aggressive preemption
strategy as it depends on whether that workload prefers latency or throughput, and as
long as it did not introduce regression :-)
Oliver, may I know if you happen to have time for a test if the following change
suggested by Honglei would make the regression go away? Thanks.
thanks,
Chenyu
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd2f3831c74e..290e5fdfc267 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1247,7 +1247,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
account_cfs_rq_runtime(cfs_rq, delta_exec);
- if (rq->nr_running == 1)
+ if (rq->cfs.nr_running == 1)
return;
if (resched || did_preempt_short(cfs_rq, curr)) {
--
2.25.1
Powered by blists - more mailing lists