lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 26 Nov 2020 17:35:45 +0800
From:   "Li, Aubrey" <>
To:     Vincent Guittot <>
Cc:     Ingo Molnar <>,
        Peter Zijlstra <>,
        Juri Lelli <>,
        Mel Gorman <>,
        Valentin Schneider <>,
        Qais Yousef <>,
        Dietmar Eggemann <>,
        Steven Rostedt <>,
        Ben Segall <>,
        Tim Chen <>,
        linux-kernel <>,
        Mel Gorman <>, Jiang Biao <>
Subject: Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for
 task wakeup

On 2020/11/26 16:14, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 14:37, Li, Aubrey <> wrote:
>> On 2020/11/25 16:31, Vincent Guittot wrote:
>>> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey <> wrote:
>>>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>>>> Hi Vincent,
>>>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>>>> Hi Aubrey,
>>>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li <> wrote:
>>>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>>>> tick to ratelimit idle cpumask update.
>>>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>>>> especially when the system is heavily loaded.
>>>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>>>> schbench have no notable change, uperf has:
>>>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>>>   threads       baseline-avg    %std    patch-avg       %std
>>>>>>>>   96            1               0.83    1.23            3.27
>>>>>>>>   144           1               1.03    1.67            2.67
>>>>>>>>   192           1               0.69    1.81            3.59
>>>>>>>>   240           1               2.84    1.51            2.67
>>>>>>>> v4->v5:
>>>>>>>> - add update_idle_cpumask for s2idle case
>>>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>>>>   idle_cpumask() everywhere
>>>>>>>> v3->v4:
>>>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>>>>   if cpu driver is available.
>>>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>>>> done during the tick to rate limit the number of updates of the
>>>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>>>> the tick stop condition too.
>>>>>> I found the current implementation has better performance at a more
>>>>>> suitable load range.
>>>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>>>> tick) to shrink idle cpumask when the system is busy, but
>>>>> I'm ok with the part above
>>>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>>> But setting the idle cpumask is useful because it helps to select an idle
>>>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>>>> the idle cpu mask is useful in heavy cases because a system, which is
>>>>> already fully busy with work, doesn't want to waste time looking for an
>>>>> idle cpu that doesn't exist.
>>>> Yes, this is what v3 does.
>>>>> But if there is an idle cpu, we should still looks for it.
>>>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
>>>> in different idle state, the idle duration could be long or could be very short.
>>>> For example, if there are two idle cpus:
>>>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
>>>> - CPU2 is in idle for a tick length and wake up to do the regular work
>>>> If both added to the idle cpumask, we want the latter one, or we can just add
>>>> the later one into the idle cpumask. That's why I want to associate tick stop
>>>> signal with it.
>>>>>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>>>>>> but only short idle, which causes tick retains. This is more fitable for
>>>>>> the real workload.
>>>>> I don't agree with this and real use cases with interaction will probably
>>>>> not agree as well as they want to run on an idle cpu if any but not wait
>>>>> on an already busy one.
>>>> The problem is scan overhead, scanning idle cpu need time. If an idle cpu
>>>> is in the short idle mode, it's very likely that when it's picked up for a
>>>> wakeup task, it goes back to work again, and the wakeup task has to wait too,
>>>> maybe longer because the running task just starts.
>>>> One benefit of waiting on the previous one is warm cache.
>>>>> Also keep in mind that a tick can be up to 10ms long
>>>> Right, but the point here is, if this 10ms tick retains, the CPU should be
>>>> in the short idle mode.
>>> But 10, 4 or even 1ms is quite long for a system and that's even more
>>> true compared to scanning the idle cpu mask
>>>>>>> This change means that a cpu will not be part of the idle mask if the
>>>>>>> tick is not stopped. On some arm/arm64 platforms, the tick stops only
>>>>>>> if the idle duration is expected to be higher than 1-2ms which starts
>>>>>>> to be significantly long. Also, the cpuidle governor can easily
>>>>>>> mis-predict a short idle duration whereas it will be finally a long
>>>>>>> idle duration; In this case, the next tick will correct the situation
>>>>>>> and select a deeper state, but this can happen up to 4ms later on
>>>>>>> arm/arm64.
>>>>>> Yes this is intented. If the tick is not stopped, that indicates the
>>>>>> CPU is very busy, cpu idle governor selected the polling idle state, and/or
>>>>>> the expected idle duration is shorter than the tick period length. For
>>>>> As mentioned above a tick can be up to 10ms long which is not a short idle
>>>>> duration.
>>>> Usually when the tick retains, the CPU is in the short idle mode or even polling
>>>> instead of idle.
>>> Also keep in mind that cpuidle can select a shallow state and retains
>>> tick because of the wake up latency constraint and not the idle
>>> duration. So you can't really make the assumption that retaining tick
>>> means short idle duration
>> idle governor has short idle information, probably can let idle governor
>> expose a short idle indicator?
>>>>> Then the governor also mispredicts the idle duration and this is one
>>>>> reason that the tick is not stopped because it will give the opportunity
>>>>> to reevaluate the idle state in case of misprediction.
>>>> We always predict the next state based on the past states, so misprediction
>>>> does happen. This is not what this patch is trying to solve. I'm certainly
>>> My point here was to say that one original goal of cpuidle for
>>> retaining the tick was to handle case where the governor mispredicts a
>>> short idle time. Retaining the tick prevents the cpu to stay too long
>>> in this shallow idle state and to waste power which seems to happen
>>> often enough to be raised by people
>> I see, thanks!
>>>> open if there is a better signal instead of stop_tick from idle governor.
>>>>>> example, uperf enters and exits 80 times between two ticks when utilizes
>>>>>> 100% CPU, and the average idle residency < 50us.
>>>>> But scheduler looks for idle state of prev cpu before looping the idle cpu
>>>>> mask so i'm not sure that uperf is impacted in this case because scheduler
>>>>> will select prev cpu before loop idle cpu mask.
>>>>>> If this CPU is added to idle cpumask, the wakeup task likely needs to
>>>>>> wait in the runqueue as this CPU will run its current task very soon.
>>>>>>> So I would prefer to keep trying to set the idle mask everytime the
>>>>>>> cpu enters idle. If a tick has not happened between 2 idle phases, the
>>>>>>> cpumask will not be updated and the overhead will be mostly testing if
>>>>>>> (rq->last_idle_state == idle_state).
>>>>>> Not sure if I addressed your concern, did you see any workloads any cases
>>>>>> v4 performs better than v5?
>>>>> Yes, I see some perf regression on my octo arm64 system for hackbench with
>>>>> only 1 group (and for few ther ones but it's less obvious). There is no
>>>>> perf impact with more groups most probably because the cpus are no more idle.
>>>>> The regression happens even if the shallowest idle state is the only one to
>>>>> be enabled.
>>>> Thanks for the data.
>>>>> - 2 x 4 cores arm64 system
>>>>> 12 iterations of hackbench -l (256000/#grp) -g #grp
>>>>> Only the shallowest state enabled
>>>>> (as a sidenote, we don't have polling mode on arm64)
>>>> Okay, this might be the cause of the difference between yours and mine. So do you
>>>> think if it makes sense to let idle governor to return a polling flag and associate
>>>> it with idle cpumask update instead of stop_tick? A CPU is idle but actually polling
>>>> may not be suitable for the wake up target.
>>> I don't know much about polling but can't this mode be used up to a tick too ?
>>> I think so. So short idle need a definition. I'm not sure if it's a good idea to define
>> the short idle as a tunable and default set it to tick >> 2?
> I have never been fond of heuristic like tick << 2 or sys tunable
> TBH, I'm not sure that using the tick is a good idea. And such kind of
> parameter need more thought
>> Updating idle cpumask everytime cpu enters idle works for me, as we have state change
>> check, so we won't actually update idle cpumask everytime the cpu enters idle.
> Yes, In this case, the overhead stays reasonable and is limited to the
> test of a rq->last_idle_state
> This will benefit heavy use a case by reducing the scanning time  and
> will not regress other use case.
>> But I'm still willing to exclude short idle case, what do you think?
> something similar to patch v3 or patch v5 + my changes seems to be a
> good 1st step that will benefit heavy use cases with regressing other
> ones.
> Trying to exclude short idle case will need more thoughts and changes
> especially about to how to get this information and if it is reliable.

okay, I'll post a v6 with v5 + your change below after data measurement.
May I add you a signed-off-by to the patch?


 kernel/sched/idle.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index a38d8822ce0d..ca32197778b0 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -156,6 +156,7 @@ static void cpuidle_idle_call(void)
+	update_idle_cpumask(this_rq(), true);
 	 * The RCU framework needs to be told that we are entering an idle
 	 * section, so no more rcu read side critical sections and one more
@@ -163,7 +164,6 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_not_available(drv, dev)) {
-		update_idle_cpumask(this_rq(), true);
@@ -194,7 +194,6 @@ static void cpuidle_idle_call(void)
 			max_latency_ns = dev->forced_idle_latency_limit_ns;
-		update_idle_cpumask(this_rq(), true);
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
@@ -208,7 +207,6 @@ static void cpuidle_idle_call(void)
 		next_state = cpuidle_select(drv, dev, &stop_tick);
 		if (stop_tick || tick_nohz_tick_stopped()) {
-			update_idle_cpumask(this_rq(), true);
 		} else {

Powered by blists - more mailing lists