linux-kernel - Re: [PATCH] sched/fair: Reschedule the cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb43844e-0ef2-44d6-9d98-496865d942b9@126.com>
Date: Mon, 3 Jun 2024 10:55:40 +0800
From: Honglei Wang <jameshongleiwang@....com>
To: Chunxin Zang <spring.cxz@...il.com>
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, Chen Yu <yu.c.chen@...el.com>,
 yangchen11@...iang.com, Jerry Zhou <zhouchunhua@...iang.com>,
 Chunxin Zang <zangchunxin@...iang.com>, mingo@...hat.com,
 Peter Zijlstra <peterz@...radead.org>, juri.lelli@...hat.com,
 vincent.guittot@...aro.org
Subject: Re: [PATCH] sched/fair: Reschedule the cfs_rq when current is
 ineligible



On 2024/5/29 22:31, Chunxin Zang wrote:
> 
> 
>> On May 25, 2024, at 19:48, Honglei Wang <jameshongleiwang@....com> wrote:
>>
>>
>>
>> On 2024/5/24 21:40, Chunxin Zang wrote:
>>> I found that some tasks have been running for a long enough time and
>>> have become illegal, but they are still not releasing the CPU. This
>>> will increase the scheduling delay of other processes. Therefore, I
>>> tried checking the current process in wakeup_preempt and entity_tick,
>>> and if it is illegal, reschedule that cfs queue.
>>> The modification can reduce the scheduling delay by about 30% when
>>> RUN_TO_PARITY is enabled.
>>> So far, it has been running well in my test environment, and I have
>>> pasted some test results below.
>>> I isolated four cores for testing. I ran Hackbench in the background
>>> and observed the test results of cyclictest.
>>> hackbench -g 4 -l 100000000 &
>>> cyclictest --mlockall -D 5m -q
>>>                                   EEVDF      PATCH  EEVDF-NO_PARITY  PATCH-NO_PARITY
>>>                  # Min Latencies: 00006      00006      00006      00006
>>>    LNICE(-19)    # Avg Latencies: 00191      00122      00089      00066
>>>                  # Max Latencies: 15442      07648      14133      07713
>>>                  # Min Latencies: 00006      00010      00006      00006
>>>    LNICE(0)      # Avg Latencies: 00466      00277      00289      00257
>>>                  # Max Latencies: 38917      32391      32665      17710
>>>                  # Min Latencies: 00019      00053      00010      00013
>>>    LNICE(19)     # Avg Latencies: 37151      31045      18293      23035
>>>                  # Max Latencies: 2688299    7031295    426196     425708
>>> I'm actually a bit hesitant about placing this modification under the
>>> NO_PARITY feature. This is because the modification conflicts with the
>>> semantics of RUN_TO_PARITY. So, I captured and compared the number of
>>> resched occurrences in wakeup_preempt to see if it introduced any
>>> additional overhead.
>>> Similarly, hackbench is used to stress the utilization of four cores to
>>> 100%, and the method for capturing the number of PREEMPT occurrences is
>>> referenced from [1].
>>> schedstats                          EEVDF       PATCH   EEVDF-NO_PARITY  PATCH-NO_PARITY  CFS(6.5)
>>> stats.check_preempt_count          5053054     5057286    5003806    5018589    5031908
>>> stats.patch_cause_preempt_count    -------     858044     -------    765726     -------
>>> stats.need_preempt_count           570520      858684     3380513    3426977    1140821
>>>  From the above test results, there is a slight increase in the number of
>>> resched occurrences in wakeup_preempt. However, the results vary with each
>>> test, and sometimes the difference is not that significant. But overall,
>>> the count of reschedules remains lower than that of CFS and is much less
>>> than that of NO_PARITY.
>>> [1]: https://lore.kernel.org/all/20230816134059.GC982867@hirez.programming.kicks-ass.net/T/#m52057282ceb6203318be1ce9f835363de3bef5cb
>>> Signed-off-by: Chunxin Zang <zangchunxin@...iang.com>
>>> Reviewed-by: Chen Yang <yangchen11@...iang.com>
>>> ---
>>>   kernel/sched/fair.c | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 03be0d1330a6..a0005d240db5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5523,6 +5523,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>>   			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>>>   		return;
>>>   #endif
>>> +
>>> +	if (!entity_eligible(cfs_rq, curr))
>>> +		resched_curr(rq_of(cfs_rq));
>>>   }
>>>     @@ -8325,6 +8328,9 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>>>   	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>>>   		return;
>>>   +	if (!entity_eligible(cfs_rq, se))
>>> +		goto preempt;
>>> +
>>>   	find_matching_se(&se, &pse);
>>>   	WARN_ON_ONCE(!pse);
>>>   
>> Hi Chunxin,
>>
>> Did you run a comparative test to see which modification is more helpful on improve the latency? Modification at tick point makes more sense to me. But, seems just resched arbitrarily in wakeup might introduce too much preemption (and maybe more context switch?) in complex environment such as cgroup hierarchy.
>>
>> Thanks,
>> Honglei
> 
> Hi Honglei
> 
> I attempted to build a slightly more complex scenario. It consists of 4 isolated cores,
> 4 groups of hackbench (160 processes in total) to stress the CPU, and 1 cyclictest
> process to test scheduling latency. Using cgroup v2, to created 64 cgroup leaf nodes
> in a binary tree structure (with a depth of 7). I then evenly distributed the aforementioned
> 161 processes across the 64 cgroups respectively, and observed the scheduling delay
> performance of cyclictest.
> 
> Unfortunately, the test results were very fluctuating, and the two sets of data were very
> close to each other. I suspect that it might be due to too few processes being distributed
> in each cgroup, which led to the logic for determining ineligible always succeeding and
> following the original logic. Later, I will attempt more tests to verify the impact of these
> modifications in scenarios involving multiple cgroups.
> 

Sorry to lately replay, I was a bit busy last week. How's the test going 
on? What about run some workload processes who spend more time in 
kernel? Maybe it's worth do give a try, but it depends on your test plan.

Thanks,
Honglei

> thanks
> Chunxin
> 
>