linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9F5C4D69-4238-44DA-9247-5CE8E7B3ECB6@gmail.com>
Date: Tue, 11 Jun 2024 21:10:50 +0800
From: Chunxin Zang <spring.cxz@...il.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
 mingo@...hat.com,
 Peter Zijlstra <peterz@...radead.org>,
 juri.lelli@...hat.com,
 vincent.guittot@...aro.org,
 dietmar.eggemann@....com,
 rostedt@...dmis.org,
 bsegall@...gle.com,
 mgorman@...e.de,
 bristot@...hat.com,
 vschneid@...hat.com,
 linux-kernel@...r.kernel.org,
 yangchen11@...iang.com,
 Jerry Zhou <zhouchunhua@...iang.com>,
 Chunxin Zang <zangchunxin@...iang.com>,
 Balakumaran Kannan <kumaran.4353@...il.com>,
 Mike Galbraith <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible



> On Jun 7, 2024, at 10:38, Chen Yu <yu.c.chen@...el.com> wrote:
> 
> On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>> 
>> 
>>> On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@...el.com> wrote:
>>> 
>>> 
>>> Sorry for the late reply and thanks for help clarify this. Yes, this is
>>> what my previous concern was:
>>> 1. It does not consider the cgroup and does not check preemption in the same
>>>  level which is covered by find_matching_se().
>>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
>>>  later pick_eevdf() will check the eligible of current anyway. But
>>>  as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
>>>  I just wonder if we could leverage the cfs_rq->next to store the next
>>>  candidate, so it can be picked directly in the 2nd pick as a fast path?
>>>  Something like below untested:
>>> 
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8a5b1ae0aa55..f716646d595e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
>>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>> {
>>>       struct task_struct *curr = rq->curr;
>>> -       struct sched_entity *se = &curr->se, *pse = &p->se;
>>> +       struct sched_entity *se = &curr->se, *pse = &p->se, *next;
>>>       struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>>>       int cse_is_idle, pse_is_idle;
>>> 
>>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>       /*
>>>        * XXX pick_eevdf(cfs_rq) != se ?
>>>        */
>>> -       if (pick_eevdf(cfs_rq) == pse)
>>> +       next = pick_eevdf(cfs_rq);
>>> +       if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
>>> +               set_next_buddy(next);
>>> +
>>> +       if (next == pse)
>>>               goto preempt;
>>> 
>>>       return;
>>> 
>>> 
>>> thanks,
>>> Chenyu
>> 
>> Hi Chen
>> 
>> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
>> the RB-tree twice, I initially had two methods in mind. 
>> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
>>  This idea is similar to the one you proposed this time. 
>> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.' 
>>  Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
>>  process to schedule' are two different things.
> 
> I agree, and it seems that in current eevdf implementation the former relies on the latter.
> 
>> 'check_preempt_wakeup_fair' is not just to
>>  check if the newly awakened process should preempt the current process; it can also serve
>>  as an opportunity to check whether any other processes should preempt the current one,
>>  thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
>>  the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
>>  then the current process will still not be preempted.
> 
> I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@gmx.de/
> 
> And I suppose you are refering to increase the preemption chance on current rather than reducing
> the invoke of pick_eevdf() in check_preempt_wakeup_fair().

Hi chen

Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf.  It's a win-win situation. :)

I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
adding some statistical counts following your previous method, in order to assess the potential
benefits of my changes.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c5453866899f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8283,6 +8286,10 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
        struct sched_entity *se = &curr->se, *pse = &p->se;
        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
        int cse_is_idle, pse_is_idle;
+       bool patch_preempt = false;
+       bool pick_preempt = false;
+
+       schedstat_inc(rq->check_preempt_count);

        if (unlikely(se == pse))
                return;
@@ -8343,15 +8350,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
        cfs_rq = cfs_rq_of(se);
        update_curr(cfs_rq);

+       if ((sched_feat(RUN_TO_PARITY) && se->vlag != se->deadline && !entity_eligible(cfs_rq, se))
+                       || (!sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, se))) {
+               schedstat_inc(rq->patch_preempt_count);
+               patch_preempt = true;
+       }
+
        /*
         * XXX pick_eevdf(cfs_rq) != se ?
         */
-       if (pick_eevdf(cfs_rq) == pse)
+       if (pick_eevdf(cfs_rq) != se) {
+               schedstat_inc(rq->pick_preempt_count);
+               pick_preempt = true;
                goto preempt;
+       }

        return;

 preempt:
+       if (patch_preempt && !pick_preempt)
+               schedstat_inc(rq->patch_preempt_only_count);
+       if (!patch_preempt && pick_preempt)
+               schedstat_inc(rq->pick_preempt_only_count);
+
+       schedstat_inc(rq->need_preempt_count);
+
        resched_curr(rq);
 }

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..002c6b0f966a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1141,6 +1141,12 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+       unsigned int            check_preempt_count;
+       unsigned int            need_preempt_count;
+       unsigned int            patch_preempt_count;
+       unsigned int            patch_preempt_only_count;
+       unsigned int            pick_preempt_count;
+       unsigned int            pick_preempt_only_count;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..fe5487572409 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,21 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu *** %u %u * %u %u * %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->check_preempt_count,
+                   rq->need_preempt_count,
+                   rq->patch_preempt_count,
+                   rq->patch_preempt_only_count,
+                   rq->pick_preempt_count,
+                   rq->pick_preempt_only_count);
+

                seq_printf(seq, "\n");

The test results are as follows:

  RUN_TO_PARITY:
                                   EEVDF        PATCH
 .stat.check_preempt_count         5053054      5029546
 .stat.need_preempt_count          0570520      1282780
 .stat.patch_preempt_count         -------      0038602
 .stat.patch_preempt_only_count    -------      0000000
 .stat.pick_preempt_count          -------      1282780
 .stat.pick_preempt_only_count     -------      1244178

  NO_RUN_TO_PARITY:
                                   EEVDF        PATCH
 .stat.check_preempt_count         5018589      5005812
 .stat.need_preempt_count          3380513      2994773
 .stat.patch_preempt_count         -------      0907927
 .stat.patch_preempt_only_count    -------      0000000
 .stat.pick_preempt_count          -------      2994773
 .stat.pick_preempt_only_count     -------      2086846

Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.

It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(


> 
>> Therefore, I posted the v2 PATCH. 
>>  The implementation of v2 PATCH might express this point more clearly. 
>> https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@gmail.com/T/
>> 
> 
> Let me take a look at it and do some tests.

Thank you for doing this :)

> 
>> I previously implemented and tested both of these methods, and the test results showed that
>> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
>> think about it, perhaps method 1 could also be viable at the same time. :)
>> 
> 
> Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
> wakeup latency/request latency are both reduced. The following is the schbench result on a
> 240 CPUs system:
> 
> NO_NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
>         50.0th: 6          (429125 samples)
>         90.0th: 14         (682355 samples)
>       * 99.0th: 29         (126695 samples)
>         99.9th: 529        (14603 samples)
>         min=1, max=4741
> Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
>         50.0th: 14992      (550939 samples)
>         90.0th: 15376      (668687 samples)
>       * 99.0th: 15600      (128111 samples)
>         99.9th: 15888      (11238 samples)
>         min=3528, max=31677
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
>         20.0th: 16864      (31 samples)
>       * 50.0th: 16928      (26 samples)
>         90.0th: 17248      (36 samples)
>         min=16615, max=20041
> average rps: 17025.23
> 
> NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
>         50.0th: 5          (376845 samples)
>         90.0th: 12         (632075 samples)
>       * 99.0th: 24         (114398 samples)
>         99.9th: 105        (13737 samples)
>         min=1, max=7428
> Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
>         50.0th: 14480      (524763 samples)
>         90.0th: 15216      (647982 samples)
>       * 99.0th: 15472      (130730 samples)
>         99.9th: 15728      (13980 samples)
>         min=3542, max=34805
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
>         20.0th: 16544      (62 samples)
>       * 50.0th: 16544      (0 samples)
>         90.0th: 16608      (37 samples)
>         min=16470, max=16648
> average rps: 16572.68
> 
> So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
> 
> thanks,
> Chenyu

I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf.  Furthermore,
even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
on the use case.

thanks
Chunxin