linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf8fdb86-194b-34c4-f5e8-dd7ddc56d8d9@amd.com>
Date: Tue, 28 May 2024 10:32:23 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chunxin Zang <spring.cxz@...il.com>, Chen Yu <yu.c.chen@...el.com>
Cc: mingo@...hat.com, Peter Zijlstra <peterz@...radead.org>,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 bristot@...hat.com, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 yangchen11@...iang.com, zhouchunhua@...iang.com, zangchunxin@...iang.com,
 Balakumaran Kannan <kumaran.4353@...il.com>, Mike Galbraith <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible

Hello Chunxin,

On 5/28/2024 8:12 AM, Chunxin Zang wrote:
> 
>> On May 24, 2024, at 23:30, Chen Yu <yu.c.chen@...el.com> wrote:
>>
>> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
>>> I found that some tasks have been running for a long enough time and
>>> have become illegal, but they are still not releasing the CPU. This
>>> will increase the scheduling delay of other processes. Therefore, I
>>> tried checking the current process in wakeup_preempt and entity_tick,
>>> and if it is illegal, reschedule that cfs queue.
>>>
>>> The modification can reduce the scheduling delay by about 30% when
>>> RUN_TO_PARITY is enabled.
>>> So far, it has been running well in my test environment, and I have
>>> pasted some test results below.
>>>
>>
>> Interesting, besides hackbench, I assume that you have workload in
>> real production environment that is sensitive to wakeup latency?
> 
> Hi Chen
> 
> Yes, my workload  are quite sensitive to wakeup latency .
>>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>> 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>> 		return;
>>> #endif
>>> +
>>> +	if (!entity_eligible(cfs_rq, curr))
>>> +		resched_curr(rq_of(cfs_rq));
>>> }
>>>
>>
>> entity_tick() -> update_curr() -> update_deadline():
>> se->vruntime >= se->deadline ? resched_curr()
>> only current has expired its slice will it be scheduled out.
>>
>> So here you want to schedule current out if its lag becomes 0.
>>
>> In lastest sched/eevdf branch, it is controlled by two sched features:
>> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
>> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
>>
>> Maybe something like this can achieve your goal
>> 	if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
>> 		resched_curr
>>
>>>
>>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>> 	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>> 		return;
>>>
>>> +	if (!entity_eligible(cfs_rq, se))
>>> +		goto preempt;
>>> +
>>
>> Not sure if this is applicable, later in this function, pick_eevdf() checks
>> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
>> be evicted. And this change does not consider the cgroup hierarchy.

The above line will be referred to as [1] below.

>>
>> Besides, the check of current eligiblity can get false negative result,
>> if the enqueued entity has a positive lag. Prateek proposed to
>> remove the check of current's eligibility in pick_eevdf():
>> https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@amd.com/
> 
> Thank you for letting me know about Peter's latest updates and thoughts.
> Actually, the original intention of my modification was to minimize the
> traversal of the rb-tree as much as possible. For example, in the following
> scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> 'pick_eevdf' to return an optimal 'se', and then trigger  'resched_curr'. After
> resched, the scheduler will call 'pick_eevdf' again, traversing the
> rb-tree once more. This ultimately results in the rb-tree being traversed
> twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> by one time.
> 
> 
> wakeup_preempt-> pick_eevdf                                      -> resched_curr
>                                                  |->'traverse the rb-tree'  |
> schedule->pick_eevdf
>                                    |->'traverse the rb-tree'

I see what you mean but a couple of things:

(I'm adding the check_preempt_wakeup_fair() hunk from the original patch
below for ease of interpretation)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03be0d1330a6..a0005d240db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>  		return;
>  
> +	if (!entity_eligible(cfs_rq, se))
> +		goto preempt;
> +

This check uses the root cfs_rq since "task_cfs_rq()" returns the
"rq->cfs" of the runqueue the task is on. In presence of cgroups or
CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
on a higher order cfs_rq and this entity_eligible() calculation might
not be valid since the vruntime calculation for the "se" is relative to
the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
I believe that is what Chenyu was referring to in [1].

>  	find_matching_se(&se, &pse);
>  	WARN_ON_ONCE(!pse);
>  
> -- 

In addition to that, There is an update_curr() call below for the first
cfs_rq where both the entities' hierarchy is queued which is found by
find_matching_se(). I believe that is required too to update the
vruntime and deadline of the entity where preemption can happen.

If you want to circumvent a second call to pick_eevdf(), could you
perhaps do:

(Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9eb63573110c..653b1bee1e62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	update_curr(cfs_rq);
 
 	/*
-	 * XXX pick_eevdf(cfs_rq) != se ?
+	 * If the hierarchy of current task is ineligible at the common
+	 * point on the newly woken entity, there is a good chance of
+	 * wakeup preemption by the newly woken entity. Mark for resched
+	 * and allow pick_eevdf() in schedule() to judge which task to
+	 * run next.
 	 */
-	if (pick_eevdf(cfs_rq) == pse)
+	if (!entity_eligible(cfs_rq, se))
 		goto preempt;
 
 	return;

--

There are other implications here which is specifically highlighted by
the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
entity is not the entity with the earliest eligible virtual deadline,
the current task is still preempted if any other entity has the EEVD.

Mike's box gave switching to above two thumbs up; I have to check what
my box says :)

Following are DeathStarBench results with your original patch compared
to v6.9-rc5 based tip:sched/core:

==================================================================
Test          : DeathStarBench
Why?	      : Some tasks here do no like aggressive preemption
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning      scaling     tip            eager_preempt (pct imp)
 1CCD           1       1.00            0.99 (%diff: -1.13%)
 2CCD           2       1.00            0.97 (%diff: -3.21%)
 4CCD           3       1.00            0.97 (%diff: -3.41%)
 8CCD           6       1.00            0.97 (%diff: -3.20%)
--

I'll give the variants mentioned in the thread a try too to see if
some of my assumptions around heavy preemption hold good. I was also
able to dig up an old patch by Balakumaran Kannan which skipped
pick_eevdf() altogether if "pse" is ineligible which also seems like
a good optimization based on current check in
check_preempt_wakeup_fair() but it perhaps doesn't help the case of 
wakeup-latency sensitivity you are optimizing for; only reduces
rb-tree traversal if there is no chance of pick_eevdf() returning "pse" 
https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@gmail.com/ 

--
Thanks and Regards,
Prateek

> 
> 
> Of course, this would break the semantics of RESPECT_SLICE as well as
> RUN_TO_PARITY. So, this might be considered a performance enhancement
> for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
> 
> thanks 
> Chunxin
> 
> 
>> If I understand your requirement correctly, you want to reduce the wakeup
>> latency. There are some codes under developed by Peter, which could
>> customized task's wakeup latency via setting its slice:
>> https://lore.kernel.org/lkml/20240405110010.934104715@infradead.org/
>>
>> thanks,
>> Chenyu