linux-kernel - Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is woken up to process an IPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c75f35d6-aabc-ab2c-9cb7-5911f14ed5a0@amd.com>
Date: Wed, 24 Jan 2024 13:56:31 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>, linux-kernel@...r.kernel.org,
 mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
 gautham.shenoy@....com, David Vernet <void@...ifault.com>
Subject: Re: [PATCH] sched/fair: Skip newidle_balance() when an idle CPU is
 woken up to process an IPI

Hello Vincent,

On 1/23/2024 7:09 PM, Vincent Guittot wrote:
> On Tue, 23 Jan 2024 at 11:01, K Prateek Nayak <kprateek.nayak@....com> wrote:
>>
>> Hello Vincent,
>>
>> On 1/23/2024 1:36 PM, Vincent Guittot wrote:
>>> On Tue, 23 Jan 2024 at 05:58, K Prateek Nayak <kprateek.nayak@....com> wrote:
>>>>
>>>> Hello Tim,
>>>>
>>>> On 1/23/2024 3:29 AM, Tim Chen wrote:
>>>>> On Fri, 2024-01-19 at 14:15 +0530, K Prateek Nayak wrote:
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index b803030c3a03..1fedc7e29c98 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -8499,6 +8499,16 @@ done: __maybe_unused;
>>>>>>      if (!rf)
>>>>>>              return NULL;
>>>>>>
>>>>>> +    /*
>>>>>> +     * An idle CPU in TIF_POLLING mode might end up here after processing
>>>>>> +     * an IPI when the sender sets the TIF_NEED_RESCHED bit and avoids
>>>>>> +     * sending an actual IPI. In such cases, where an idle CPU was woken
>>>>>> +     * up only to process an interrupt, without necessarily queuing a task
>>>>>> +     * on it, skip newidle_balance() to facilitate faster idle re-entry.
>>>>>> +     */
>>>>>> +    if (prev == rq->idle)
>>>>>> +            return NULL;
>>>>>> +
>>>>>
>>>>> Should we check the call function queue directly to detect that there is
>>>>> an IPI waiting to be processed? something like
>>>>>
>>>>>       if (!llist_empty(&per_cpu(call_single_queue, rq->cpu)))
>>>>>               return NULL;
>>>>
>>>> That could be a valid check too. However, if an IPI is queued right
>>>> after this check, the processing is still delayed since
>>>> newidle_balance() only bails out for scenarios when a wakeup is trying
>>>> to queue a new task on the CPU running the newidle_balance().
>>>>
>>>>>
>>>>> Could there be cases where we want to do idle balance in this code path?
>>>>> Say a cpu is idle and a scheduling tick came in, we may try
>>>>> to look for something to run on the idle cpu.  Seems like after
>>>>> your change above, that would be skipped.
>>>>
>>>> Wouldn't scheduler_tick() do load balancing when the time comes? In my
>>>> testing, I did not see a case where the workloads I tested were
>>>> sensitive to the aspect of newidle_balance() being invoked at scheduler
>>>> tick. Have you come across a workload which might be sensitive to this
>>>> aspect that I can quickly test and verify? Meanwhile, I'll run the
>>>> workloads mentioned in the commit log on an Intel system to see if I
>>>> can spot any sensitivity to this change.
>>>
>>> Instead of trying to fix spurious need_resched in the scheduler,
>>> can't we find a way to prevent it from happening ?
>>
>> The need_resched is not spurious. It is an effect of the optimization
>> introduced by commit b2a02fc43a1f ("smp: Optimize
>> send_call_function_single_ipi()") where, to pull a CPU out of
>> TIF_POLLING out of idle (and this happens for C0 (POLL) and C1 (MWAIT)
>> on the test machine), instead of sending an IPI for
>> smp_call_function_single(), the sender sets the TIF_NEED_RESCHED flag in
>> the idle task's thread info and in the path to "schedule_idle()", the
>> call to "flush_smp_call_function_queue()" processes the function call.
> 
> I mean it's spurious in the sense that we don't need to resched but we
> need to pull the CPU out of the polling loop. At that time we don't
> know if there is a need to resched

Agreed.

> 
>>
>> But since "TIF_NEED_RESCHED" was set to pull the CPU out of idle, the
>> scheduler now believes a new task exists which leads to the following
>> call stack:
> 
> Exactly, TIF_NEED_RESCHED has been set so scheduler now believes it
> needs to look for a task. The solution is to not set TIF_NEED_RESCHED
> if you don't want the scheduler to look for a task including pulling
> it from another cpu
> 
>>
>>   do_idle()
>>     schedule_idle()
>>       __schedule(SM_NONE)
>>         /* local_irq_disable() */
>>         pick_next_task()
>>           __pick_next_task()
>>             pick_next_task_fair()
>>               newidle_balance()
>>               ... /* Still running with IRQs disabled */
>>
>> Since IRQs are disabled, the processing of IPIs are delayed leading
>> issue similar to the one outlined in commit 792b9f65a568 ("sched:
>> Allow newidle balancing to bail out of load_balance") when benchmarking
>> ipistorm.
> 
> IMO it's not the same because commit 792b9f65a568 wants to abort early
> if something new happened

In case of commit 792b9f65a568, the newidle_balance() is cut short since
a pending IPI wants to queue a task which otherwise would have otherwise
led to a spike in tail latency. For ipistorm, there is a pending IPI
which is not queuing a task, but since the smp_call_function_single()
was invoked with wait=1, the sender will wait until the the the target
enables interrupts again and processes the call function which manifests
as a spike in tail latency.

> 
>>
>>>
>>> Because of TIF_NEED_RESCHED being set when TIF_POLLING is set, idle
>>> load balances are already skipped for a less aggressive newly idle
>>> load balanced:
>>> https://lore.kernel.org/all/CAKfTPtC9Px_W84YRJqnFNkL8oofO15D-P=VTCMUUu7NJr+xwBA@mail.gmail.com/
>>
>> Are you referring to the "need_resched()" condition check in
>> "nohz_csd_func()"? Please correct me if I'm wrong.
> 
> yes
> 
>>
>> When I ran with sched-scoreboard
>> (https://github.com/AMDESE/sched-scoreboard/)with the patch on an idle
>> system for 60s I see the idle "load_balance count" go up in sched-stat
> 
> If TIF_POLLING is not set, you will use normal IPI but otherwise, the
> wakeup for an idle load balance is skipped because need_resched is set
> and we have an newly idle load balance  which you now want to skip too

I see! You are right.

> 
>>
>> Following are the data for idle balance on SMT domain for each kernel:
>>
>> o tip:sched/core
>>
>>   < ----------------------------------------  Category:  idle ----------- >
>>   load_balance count on cpu idle                             :       2678
>>   load_balance found balanced on cpu idle                    :       2678
>>     ->load_balance failed to find busier queue on cpu idle   :          0
>>     ->load_balance failed to find busier group on cpu idle   :       2678
>>   load_balance move task failed on cpu idle                  :          0
>>   *load_balance success count on cpu idle                    :          0
>>   imbalance sum on cpu idle                                  :          0
>>   pull_task count on cpu idle                                :          0
>>   *avg task pulled per successfull lb attempt (cpu idle)     :          0
>>     ->pull_task when target task was cache-hot on cpu idle   :          0
>>   -------------------------------------------------------------------------
>>
>> o tip:sched/core + patch
>>
>>   < ----------------------------------------  Category:  idle ----------- >
>>   load_balance count on cpu idle                             :       1895
>>   load_balance found balanced on cpu idle                    :       1895
>>     ->load_balance failed to find busier queue on cpu idle   :          0
>>     ->load_balance failed to find busier group on cpu idle   :       1895
>>   load_balance move task failed on cpu idle                  :          0
>>   *load_balance success count on cpu idle                    :          0
>>   imbalance sum on cpu idle                                  :          0
>>   pull_task count on cpu idle                                :          0
>>   *avg task pulled per successfull lb attempt (cpu idle)     :          0
>>     ->pull_task when target task was cache-hot on cpu idle   :          0
>>   -------------------------------------------------------------------------
>>
>> Am I missing something? Since "load_balance count" is only updated when
>> "load_balance()" is called.
>>
>>>
>>> The root of the problem is that we keep TIF_NEED_RESCHED set
>>
>> We had prototyped a TIF_NEED_IPI flag to skip calls to schedule_idle()
>> on CPUs in TIF_POLLING when the idle CPU has to only process an IPI.
>> Although it solves the problem described in the commit log, it also
>> required enabling and testing it on multiple architectures.
> 
> Yes, but that's the right solution IMO and it will prevent us to then
> try to catch the needless TIF_NEED_RESCHED

I'll discuss with Gautham on cleaning up the prototype and testing it
some more (we had only looked at ipistorm case) before sending out an
RFC.

> [..snip..]

--
Thanks and Regards,
Prateek