linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZmCeBwNO2t2ikrcm@chenyu5-mobl2>
Date: Thu, 6 Jun 2024 01:19:03 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Chunxin Zang
	<spring.cxz@...il.com>
CC: Chunxin Zang <spring.cxz@...il.com>, <mingo@...hat.com>, Peter Zijlstra
	<peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<bristot@...hat.com>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<yangchen11@...iang.com>, <zhouchunhua@...iang.com>,
	<zangchunxin@...iang.com>, Balakumaran Kannan <kumaran.4353@...il.com>, "Mike
 Galbraith" <efault@....de>
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible

Hi Prateek, Chunxin,

On 2024-05-28 at 10:32:23 +0530, K Prateek Nayak wrote:
> Hello Chunxin,
> 
> On 5/28/2024 8:12 AM, Chunxin Zang wrote:
> > 
> >> On May 24, 2024, at 23:30, Chen Yu <yu.c.chen@...el.com> wrote:
> >>
> >> On 2024-05-24 at 21:40:11 +0800, Chunxin Zang wrote:
> >>> I found that some tasks have been running for a long enough time and
> >>> have become illegal, but they are still not releasing the CPU. This
> >>> will increase the scheduling delay of other processes. Therefore, I
> >>> tried checking the current process in wakeup_preempt and entity_tick,
> >>> and if it is illegal, reschedule that cfs queue.
> >>>
> >>> The modification can reduce the scheduling delay by about 30% when
> >>> RUN_TO_PARITY is enabled.
> >>> So far, it has been running well in my test environment, and I have
> >>> pasted some test results below.
> >>>
> >>
> >> Interesting, besides hackbench, I assume that you have workload in
> >> real production environment that is sensitive to wakeup latency?
> > 
> > Hi Chen
> > 
> > Yes, my workload  are quite sensitive to wakeup latency .
> >>
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 03be0d1330a6..a0005d240db5 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> >>> 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
> >>> 		return;
> >>> #endif
> >>> +
> >>> +	if (!entity_eligible(cfs_rq, curr))
> >>> +		resched_curr(rq_of(cfs_rq));
> >>> }
> >>>
> >>
> >> entity_tick() -> update_curr() -> update_deadline():
> >> se->vruntime >= se->deadline ? resched_curr()
> >> only current has expired its slice will it be scheduled out.
> >>
> >> So here you want to schedule current out if its lag becomes 0.
> >>
> >> In lastest sched/eevdf branch, it is controlled by two sched features:
> >> RESPECT_SLICE: Inhibit preemption until the current task has exhausted it's slice.
> >> RUN_TO_PARITY: Relax RESPECT_SLICE and only protect current until 0-lag.
> >> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e04f5454d68590a239092a700e9bbaf84270397c
> >>
> >> Maybe something like this can achieve your goal
> >> 	if (sched_feat(RUN_TOPARITY) && !entity_eligible(cfs_rq, curr))
> >> 		resched_curr
> >>
> >>>
> >>> @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >>> 	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> >>> 		return;
> >>>
> >>> +	if (!entity_eligible(cfs_rq, se))
> >>> +		goto preempt;
> >>> +
> >>
> >> Not sure if this is applicable, later in this function, pick_eevdf() checks
> >> if the current is eligible, !entity_eligible(cfs_rq, curr), if not, curr will
> >> be evicted. And this change does not consider the cgroup hierarchy.
> 
> The above line will be referred to as [1] below.
> 
> >>
> >> Besides, the check of current eligiblity can get false negative result,
> >> if the enqueued entity has a positive lag. Prateek proposed to
> >> remove the check of current's eligibility in pick_eevdf():
> >> https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@amd.com/
> > 
> > Thank you for letting me know about Peter's latest updates and thoughts.
> > Actually, the original intention of my modification was to minimize the
> > traversal of the rb-tree as much as possible. For example, in the following
> > scenario, if 'curr' is ineligible, the system would still traverse the rb-tree in
> > 'pick_eevdf' to return an optimal 'se', and then trigger  'resched_curr'. After
> > resched, the scheduler will call 'pick_eevdf' again, traversing the
> > rb-tree once more. This ultimately results in the rb-tree being traversed
> > twice. If it's possible to determine that 'curr' is ineligible within 'wakeup_preempt'
> > and directly trigger a 'resched', it would reduce the traversal of the rb-tree
> > by one time.
> > 
> > 
> > wakeup_preempt-> pick_eevdf                                      -> resched_curr
> >                                                  |->'traverse the rb-tree'  |
> > schedule->pick_eevdf
> >                                    |->'traverse the rb-tree'
> 
> I see what you mean but a couple of things:
> 
> (I'm adding the check_preempt_wakeup_fair() hunk from the original patch
> below for ease of interpretation)
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 03be0d1330a6..a0005d240db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
> >  		return;
> >  
> > +	if (!entity_eligible(cfs_rq, se))
> > +		goto preempt;
> > +
> 
> This check uses the root cfs_rq since "task_cfs_rq()" returns the
> "rq->cfs" of the runqueue the task is on. In presence of cgroups or
> CONFIG_SCHED_AUTOGROUP, there is a good chance this the task is queued
> on a higher order cfs_rq and this entity_eligible() calculation might
> not be valid since the vruntime calculation for the "se" is relative to
> the "cfs_rq" where it is queued on. Please correct me if I'm wrong but
> I believe that is what Chenyu was referring to in [1].
>

Sorry for the late reply and thanks for help clarify this. Yes, this is
what my previous concern was:
1. It does not consider the cgroup and does not check preemption in the same
   level which is covered by find_matching_se().
2. The if (!entity_eligible(cfs_rq, se)) for current is redundant because
   later pick_eevdf() will check the eligible of current anyway. But
   as pointed out by Chunxi, his concern is the double-traverse of the rb-tree,
   I just wonder if we could leverage the cfs_rq->next to store the next
   candidate, so it can be picked directly in the 2nd pick as a fast path?
   Something like below untested:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..f716646d595e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8349,7 +8349,7 @@ static void set_next_buddy(struct sched_entity *se)
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
        struct task_struct *curr = rq->curr;
-       struct sched_entity *se = &curr->se, *pse = &p->se;
+       struct sched_entity *se = &curr->se, *pse = &p->se, *next;
        struct cfs_rq *cfs_rq = task_cfs_rq(curr);
        int cse_is_idle, pse_is_idle;
 
@@ -8415,7 +8415,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
        /*
         * XXX pick_eevdf(cfs_rq) != se ?
         */
-       if (pick_eevdf(cfs_rq) == pse)
+       next = pick_eevdf(cfs_rq);
+       if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && next)
+               set_next_buddy(next);
+
+       if (next == pse)
                goto preempt;
 
        return;


thanks,
Chenyu

> >  	find_matching_se(&se, &pse);
> >  	WARN_ON_ONCE(!pse);
> >  
> > -- 
> 
> In addition to that, There is an update_curr() call below for the first
> cfs_rq where both the entities' hierarchy is queued which is found by
> find_matching_se(). I believe that is required too to update the
> vruntime and deadline of the entity where preemption can happen.
> 
> If you want to circumvent a second call to pick_eevdf(), could you
> perhaps do:
> 
> (Only build tested)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..653b1bee1e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8407,9 +8407,13 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	update_curr(cfs_rq);
>  
>  	/*
> -	 * XXX pick_eevdf(cfs_rq) != se ?
> +	 * If the hierarchy of current task is ineligible at the common
> +	 * point on the newly woken entity, there is a good chance of
> +	 * wakeup preemption by the newly woken entity. Mark for resched
> +	 * and allow pick_eevdf() in schedule() to judge which task to
> +	 * run next.
>  	 */
> -	if (pick_eevdf(cfs_rq) == pse)
> +	if (!entity_eligible(cfs_rq, se))
>  		goto preempt;
>  
>  	return;
> 
> --
> 
> There are other implications here which is specifically highlighted by
> the "XXX pick_eevdf(cfs_rq) != se ?" comment. If the current waking
> entity is not the entity with the earliest eligible virtual deadline,
> the current task is still preempted if any other entity has the EEVD.
> 
> Mike's box gave switching to above two thumbs up; I have to check what
> my box says :)
> 
> Following are DeathStarBench results with your original patch compared
> to v6.9-rc5 based tip:sched/core:
> 
> ==================================================================
> Test          : DeathStarBench
> Why?	      : Some tasks here do no like aggressive preemption
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : Mean
> ==================================================================
> Pinning      scaling     tip            eager_preempt (pct imp)
>  1CCD           1       1.00            0.99 (%diff: -1.13%)
>  2CCD           2       1.00            0.97 (%diff: -3.21%)
>  4CCD           3       1.00            0.97 (%diff: -3.41%)
>  8CCD           6       1.00            0.97 (%diff: -3.20%)
> --
> 
> I'll give the variants mentioned in the thread a try too to see if
> some of my assumptions around heavy preemption hold good. I was also
> able to dig up an old patch by Balakumaran Kannan which skipped
> pick_eevdf() altogether if "pse" is ineligible which also seems like
> a good optimization based on current check in
> check_preempt_wakeup_fair() but it perhaps doesn't help the case of 
> wakeup-latency sensitivity you are optimizing for; only reduces
> rb-tree traversal if there is no chance of pick_eevdf() returning "pse" 
> https://lore.kernel.org/lkml/20240301130100.267727-1-kumaran.4353@gmail.com/ 
> 
> --
> Thanks and Regards,
> Prateek
> 
> > 
> > 
> > Of course, this would break the semantics of RESPECT_SLICE as well as
> > RUN_TO_PARITY. So, this might be considered a performance enhancement
> > for scenarios without NO_RESPECT_SLICE/NO_RUN_TO_PARITY.
> > 
> > thanks 
> > Chunxin
> > 
> > 
> >> If I understand your requirement correctly, you want to reduce the wakeup
> >> latency. There are some codes under developed by Peter, which could
> >> customized task's wakeup latency via setting its slice:
> >> https://lore.kernel.org/lkml/20240405110010.934104715@infradead.org/
> >>
> >> thanks,
> >> Chenyu
>