[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9F5C4D69-4238-44DA-9247-5CE8E7B3ECB6@gmail.com>
Date: Tue, 11 Jun 2024 21:10:50 +0800
From: Chunxin Zang <spring.cxz@...il.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>,
mingo@...hat.com,
Peter Zijlstra <peterz@...radead.org>,
juri.lelli@...hat.com,
vincent.guittot@...aro.org,
dietmar.eggemann@....com,
rostedt@...dmis.org,
bsegall@...gle.com,
mgorman@...e.de,
bristot@...hat.com,
vschneid@...hat.com,
linux-kernel@...r.kernel.org,
yangchen11@...iang.com,
Jerry Zhou <zhouchunhua@...iang.com>,
Chunxin Zang <zangchunxin@...iang.com>,
Balakumaran Kannan <kumaran.4353@...il.com>,
Mike Galbraith <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
ineligible
> On Jun 7, 2024, at 10:38, Chen Yu <yu.c.chen@...el.com> wrote:
>
> On 2024-06-06 at 09:46:53 +0800, Chunxin Zang wrote:
>>
>>
>>> On Jun 6, 2024, at 01:19, Chen Yu <yu.c.chen@...el.com> wrote:
>>>
>>>
>>> Sorry for the late reply and thanks for help clarify this. Yes, this is
>>> what my previous concern was:
>>> 1. It does not consider the cgroup and does not check preemption in the same
>>> level which is covered by find_matching_se().
>>> 2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
>>> later pick_eevdf() will check the eligible of current anyway. But
>>> as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
>>> I just wonder if we could leverage the cfs_rq->next to store the next
>>> candidate, so it can be picked directly in the 2nd pick as a fast path?
>>> Something like below untested:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8a5b1ae0aa55..f716646d595e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
>>> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>> {
>>> struct task_struct *curr = rq->curr;
>>> - struct sched_entity *se = &curr->se, *pse = &p->se;
>>> + struct sched_entity *se = &curr->se, *pse = &p->se, *next;
>>> struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>>> int cse_is_idle, pse_is_idle;
>>>
>>> @@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> /*
>>> * XXX pick_eevdf(cfs_rq) != se ?
>>> */
>>> - if (pick_eevdf(cfs_rq) == pse)
>>> + next = pick_eevdf(cfs_rq);
>>> + if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
>>> + set_next_buddy(next);
>>> +
>>> + if (next == pse)
>>> goto preempt;
>>>
>>> return;
>>>
>>>
>>> thanks,
>>> Chenyu
>>
>> Hi Chen
>>
>> First of all, thank you for your patient response. Regarding the issue of avoiding traversing
>> the RB-tree twice, I initially had two methods in mind.
>> 1. Cache the optimal result so that it can be used directly during the second pick_eevdf operation.
>> This idea is similar to the one you proposed this time.
>> 2. Avoid the pick_eevdf operation as much as possible within 'check_preempt_wakeup_fair.'
>> Because I believe that 'checking whether preemption is necessary' and 'finding the optimal
>> process to schedule' are two different things.
>
> I agree, and it seems that in current eevdf implementation the former relies on the latter.
>
>> 'check_preempt_wakeup_fair' is not just to
>> check if the newly awakened process should preempt the current process; it can also serve
>> as an opportunity to check whether any other processes should preempt the current one,
>> thereby improving the real-time performance of the scheduler. Although now in pick_eevdf,
>> the legitimacy of 'curr' is also evaluated, if the result returned is not the awakened process,
>> then the current process will still not be preempted.
>
> I thought Mike has proposed a patch to deal with this scenario you mentioned above:
> https://lore.kernel.org/lkml/e17d3d90440997b970067fe9eaf088903c65f41d.camel@gmx.de/
>
> And I suppose you are refering to increase the preemption chance on current rather than reducing
> the invoke of pick_eevdf() in check_preempt_wakeup_fair().
Hi chen
Happy holidays. I believe the modifications here will indeed provide more opportunities for preemption,
thereby leading to lower scheduling latencies, while also truly reducing calls to pick_eevdf. It's a win-win situation. :)
I conducted a test. It involved applying my modifications on top of MIKE PATCH, along with
adding some statistical counts following your previous method, in order to assess the potential
benefits of my changes.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..c5453866899f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8283,6 +8286,10 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int cse_is_idle, pse_is_idle;
+ bool patch_preempt = false;
+ bool pick_preempt = false;
+
+ schedstat_inc(rq->check_preempt_count);
if (unlikely(se == pse))
return;
@@ -8343,15 +8350,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);
+ if ((sched_feat(RUN_TO_PARITY) && se->vlag != se->deadline && !entity_eligible(cfs_rq, se))
+ || (!sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, se))) {
+ schedstat_inc(rq->patch_preempt_count);
+ patch_preempt = true;
+ }
+
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) != se) {
+ schedstat_inc(rq->pick_preempt_count);
+ pick_preempt = true;
goto preempt;
+ }
return;
preempt:
+ if (patch_preempt && !pick_preempt)
+ schedstat_inc(rq->patch_preempt_only_count);
+ if (!patch_preempt && pick_preempt)
+ schedstat_inc(rq->pick_preempt_only_count);
+
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d2242679239e..002c6b0f966a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1141,6 +1141,12 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int patch_preempt_count;
+ unsigned int patch_preempt_only_count;
+ unsigned int pick_preempt_count;
+ unsigned int pick_preempt_only_count;
#endif
#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..fe5487572409 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,21 @@ static int show_schedstat(struct seq_file *seq, void *v)
/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu *** %u %u * %u %u * %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count,
+ rq->need_preempt_count,
+ rq->patch_preempt_count,
+ rq->patch_preempt_only_count,
+ rq->pick_preempt_count,
+ rq->pick_preempt_only_count);
+
seq_printf(seq, "\n");
The test results are as follows:
RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5053054 5029546
.stat.need_preempt_count 0570520 1282780
.stat.patch_preempt_count ------- 0038602
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 1282780
.stat.pick_preempt_only_count ------- 1244178
NO_RUN_TO_PARITY:
EEVDF PATCH
.stat.check_preempt_count 5018589 5005812
.stat.need_preempt_count 3380513 2994773
.stat.patch_preempt_count ------- 0907927
.stat.patch_preempt_only_count ------- 0000000
.stat.pick_preempt_count ------- 2994773
.stat.pick_preempt_only_count ------- 2086846
Looking at the results, adding an ineligible check for the se within check_preempt_wakeup_fair
can prevent 3% of pick_eevdf calls under the RUN_TO_PARITY feature, and in the case of
NO_RUN_TO_PARITY, it can prevent 30% of pick_eevdf calls. It was also discovered that the
patch_preempt_only_count is at 0, indicating that all invalid checks for the se are correct.
It's worth mentioning that under the RUN_TO_PARITY feature, the number of preemptions
triggered by 'pick_eevdf != se' would be 2.25 times that of the original version, which could
lead to a series of other performance issues. However, logically speaking, this is indeed reasonable. :(
>
>> Therefore, I posted the v2 PATCH.
>> The implementation of v2 PATCH might express this point more clearly.
>> https://lore.kernel.org/lkml/20240529141806.16029-1-spring.cxz@gmail.com/T/
>>
>
> Let me take a look at it and do some tests.
Thank you for doing this :)
>
>> I previously implemented and tested both of these methods, and the test results showed that
>> method 2 had somewhat more obvious benefits. Therefore, I submitted method 2. Now that I
>> think about it, perhaps method 1 could also be viable at the same time. :)
>>
>
> Actually I found that, even without any changes, if we enabled sched feature NEXT_BUDDY, the
> wakeup latency/request latency are both reduced. The following is the schbench result on a
> 240 CPUs system:
>
> NO_NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1698990 total samples)
> 50.0th: 6 (429125 samples)
> 90.0th: 14 (682355 samples)
> * 99.0th: 29 (126695 samples)
> 99.9th: 529 (14603 samples)
> min=1, max=4741
> Request Latencies percentiles (usec) runtime 100 (s) (1702523 total samples)
> 50.0th: 14992 (550939 samples)
> 90.0th: 15376 (668687 samples)
> * 99.0th: 15600 (128111 samples)
> 99.9th: 15888 (11238 samples)
> min=3528, max=31677
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
> 20.0th: 16864 (31 samples)
> * 50.0th: 16928 (26 samples)
> 90.0th: 17248 (36 samples)
> min=16615, max=20041
> average rps: 17025.23
>
> NEXT_BUDDY
> Wakeup Latencies percentiles (usec) runtime 100 (s) (1653564 total samples)
> 50.0th: 5 (376845 samples)
> 90.0th: 12 (632075 samples)
> * 99.0th: 24 (114398 samples)
> 99.9th: 105 (13737 samples)
> min=1, max=7428
> Request Latencies percentiles (usec) runtime 100 (s) (1657268 total samples)
> 50.0th: 14480 (524763 samples)
> 90.0th: 15216 (647982 samples)
> * 99.0th: 15472 (130730 samples)
> 99.9th: 15728 (13980 samples)
> min=3542, max=34805
> RPS percentiles (requests) runtime 100 (s) (101 total samples)
> 20.0th: 16544 (62 samples)
> * 50.0th: 16544 (0 samples)
> 90.0th: 16608 (37 samples)
> min=16470, max=16648
> average rps: 16572.68
>
> So I think NEXT_BUDDY has more or less reduced the rb-tree scan.
>
> thanks,
> Chenyu
I'm not completely sure if my understanding is correct, but NEXT_BUDDY can only cache the process
that has been woken up; it doesn't necessarily correspond to the result returned by pick_eevdf. Furthermore,
even if it does cache the result returned by pick_eevdf, by the time the next scheduling occurs, due to
other processes enqueing or dequeuing, it might not be the result picked by pick_eevdf at that moment.
Hence, it's a 'best effort' approach, and therefore, its impact on scheduling latency may vary depending
on the use case.
thanks
Chunxin
Powered by blists - more mailing lists