linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Zmrb9YlcUAo10TsP@chenyu5-mobl2>
Date: Thu, 13 Jun 2024 19:45:57 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Chunxin Zang <spring.cxz@...il.com>
CC: K Prateek Nayak <kprateek.nayak@....com>, <mingo@...hat.com>, "Peter
 Zijlstra" <peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<bristot@...hat.com>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<yangchen11@...iang.com>, Jerry Zhou <zhouchunhua@...iang.com>, Chunxin Zang
	<zangchunxin@...iang.com>, Balakumaran Kannan <kumaran.4353@...il.com>, "Mike
 Galbraith" <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible

On 2024-06-11 at 21:10:50 +0800, Chunxin Zang wrote:
> 
> 
> > On Jun 7, 2024, at 10:38, Chen Yu <yu.c.chen@...el.com> wrote:
> > 
> > On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
> >> 
> >> 
> >>> On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@...el.com> wrote:
> >>> 
> >>> 
> >>> Sorry for the late reply and thanks for help clarify this. Yes, this is
> >>> what my previous concern was:
> >>> 1. It does not consider the cgroup and does not check preemption in the same
> >>>  level which is covered by find_matching_se().
> >>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> >>>  later pick_eevdf() will check the eligible of current anyway. But
> >>>  as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> >>>  I just wonder if we could leverage the cfs_rq->next to store the next
> >>>  candidate, so it can be picked directly in the 2nd pick as a fast path?
> >>>  Something like below untested:
> >>> 
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 8a5b1ae0aa55..f716646d595e 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> >>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> >>> {
> >>>       struct task_struct *curr = rq->curr;
> >>> -       struct sched_entity *se = &curr->se, *pse = &p->se;
> >>> +       struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> >>>       struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> >>>       int cse_is_idle, pse_is_idle;
> >>> 
> >>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >>>       /*
> >>>        * XXX pick_eevdf(cfs_rq) != se ?
> >>>        */
> >>> -       if (pick_eevdf(cfs_rq) == pse)
> >>> +       next = pick_eevdf(cfs_rq);
> >>> +       if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> >>> +               set_next_buddy(next);
> >>> +
> >>> +       if (next == pse)
> >>>               goto preempt;
> >>> 
> >>>       return;
> >>> 
> >>> 
> >>> thanks,
> >>> Chenyu
> >> 
> >> Hi Chen
> >> 
> >> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
> >> the RB-tree twice, I initially had two methods in mind. 
> >> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
> >>  This idea is similar to the one you proposed this time. 
> >> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.' 
> >>  Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
> >>  process to schedule' are two different things.
> > 
> > I agree, and it seems that in current eevdf implementation the former relies on the latter.
> > 
> >> 'check_preempt_wakeup_fair' is not just to
> >>  check if the newly awakened process should preempt the current process; it can also serve
> >>  as an opportunity to check whether any other processes should preempt the current one,
> >>  thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
> >>  the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
> >>  then the current process will still not be preempted.
> > 
> > I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> > https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@gmx.de/
> > 
> > And I suppose you are refering to increase the preemption chance on current rather than reducing
> > the invoke of pick_eevdf() in check_preempt_wakeup_fair().
> 
> Hi chen
> 
> Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
> thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf.  It's a win-win situation. :)
> 
> I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
> adding some statistical counts following your previous method, in order to assess the potential
> benefits of my changes.
>

[snip]
 
> Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
> can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
> NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
> patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.
> 
> It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
> triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
> lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(
> 
>

I wonder if we can only do this for NO_RUN_TO_PARITY? That is to say, if RUN_TO_PARITY is enabled,
we do not preempt the current task based on its eligibility in check_preempt_wakeup_fair()
or entity_tick(). Personally I don't have objection to increase the preemption a little bit, however
it seems that we have encountered over-scheduling and that is why RUN_TO_PARITY was introduced,
and RUN_TO_PARITY means "respect the slice" per my understanding.

> > So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
> > 
> > thanks,
> > Chenyu
> 
> I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
> that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf.  Furthermore,
> even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
> other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
> Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
> on the use case.
>

That is true, currently the NEXT_BUDDY is set to the wakee if it is eligible, not mean it is the best
candidate in the tree. I think it is 'best effort' to reduce the wakeup latency rather than fairness.

thanks,
Chenyu