linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZmJylkNFg7EFgPmZ@chenyu5-mobl2>
Date: Fri, 7 Jun 2024 10:38:14 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Chunxin Zang <spring.cxz@...il.com>
CC: K Prateek Nayak <kprateek.nayak@....com>, <mingo@...hat.com>, "Peter
 Zijlstra" <peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<bristot@...hat.com>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<yangchen11@...iang.com>, Jerry Zhou <zhouchunhua@...iang.com>, Chunxin Zang
	<zangchunxin@...iang.com>, Balakumaran Kannan <kumaran.4353@...il.com>, "Mike
 Galbraith" <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible

On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
> 
> 
> > On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@...el.com> wrote:
> > 
> > 
> > Sorry for the late reply and thanks for help clarify this. Yes, this is
> > what my previous concern was:
> > 1. It does not consider the cgroup and does not check preemption in the same
> >   level which is covered by find_matching_se().
> > 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
> >   later pick_eevdf() will check the eligible of current anyway. But
> >   as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
> >   I just wonder if we could leverage the cfs_rq->next to store the next
> >   candidate, so it can be picked directly in the 2nd pick as a fast path?
> >   Something like below untested:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..f716646d595e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
> > static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> > {
> >        struct task_struct *curr = rq->curr;
> > -       struct sched_entity *se = &curr->se, *pse = &p->se;
> > +       struct sched_entity *se = &curr->se, *pse = &p->se, *next;
> >        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> >        int cse_is_idle, pse_is_idle;
> > 
> > @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >        /*
> >         * XXX pick_eevdf(cfs_rq) != se ?
> >         */
> > -       if (pick_eevdf(cfs_rq) == pse)
> > +       next = pick_eevdf(cfs_rq);
> > +       if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
> > +               set_next_buddy(next);
> > +
> > +       if (next == pse)
> >                goto preempt;
> > 
> >        return;
> > 
> > 
> > thanks,
> > Chenyu
> 
> Hi Chen
> 
> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
> the RB-tree twice, I initially had two methods in mind. 
> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
>   This idea is similar to the one you proposed this time. 
> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.' 
>   Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
>   process to schedule' are two different things.

I agree, and it seems that in current eevdf implementation the former relies on the latter.

> 'check_preempt_wakeup_fair' is not just to
>   check if the newly awakened process should preempt the current process; it can also serve
>   as an opportunity to check whether any other processes should preempt the current one,
>   thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
>   the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
>   then the current process will still not be preempted.

I thought Mike has proposed a patch to deal with this scenario you mentioned above:
https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@gmx.de/

And I suppose you are refering to increase the preemption chance on current rather than reducing
the invoke of pick_eevdf() in check_preempt_wakeup_fair().

> Therefore, I posted the v2 PATCH. 
>   The implementation of v2 PATCH might express this point more clearly. 
> https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@gmail.com/T/
>

Let me take a look at it and do some tests.
 
> I previously implemented and tested both of these methods, and the test results showed that
> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
> think about it, perhaps method 1 could also be viable at the same time. :)
>

Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
wakeup latency/request latency are both reduced. The following is the schbench result on a
240 CPUs system:

NO_NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
        50.0th: 6          (429125 samples)
        90.0th: 14         (682355 samples)
      * 99.0th: 29         (126695 samples)
        99.9th: 529        (14603 samples)
        min=1, max=4741
Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
        50.0th: 14992      (550939 samples)
        90.0th: 15376      (668687 samples)
      * 99.0th: 15600      (128111 samples)
        99.9th: 15888      (11238 samples)
        min=3528, max=31677
RPS percentiles (requests) runtime 100 (s) (101 total samples)
        20.0th: 16864      (31 samples)
      * 50.0th: 16928      (26 samples)
        90.0th: 17248      (36 samples)
        min=16615, max=20041
average rps: 17025.23

NEXT_BUDDY
Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
        50.0th: 5          (376845 samples)
        90.0th: 12         (632075 samples)
      * 99.0th: 24         (114398 samples)
        99.9th: 105        (13737 samples)
        min=1, max=7428
Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
        50.0th: 14480      (524763 samples)
        90.0th: 15216      (647982 samples)
      * 99.0th: 15472      (130730 samples)
        99.9th: 15728      (13980 samples)
        min=3542, max=34805
RPS percentiles (requests) runtime 100 (s) (101 total samples)
        20.0th: 16544      (62 samples)
      * 50.0th: 16544      (0 samples)
        90.0th: 16608      (37 samples)
        min=16470, max=16648
average rps: 16572.68

So I think NEXT_BUDDY has more or less reduced the rb-tree scan.

thanks,
Chenyu