[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABcWv9_JgjjHqH-dKBOtwLt3njrDtsh5a172svh_b35f2BreBg@mail.gmail.com>
Date: Tue, 2 Dec 2025 00:26:01 -0600
From: Tingjia Cao <tjcao980311@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
Hello Prateek,
Thanks for the explanation, and we didn't observe a clear performance
impact in our workloads.
We originally noticed this issue because some CPU selection decisions
for sync wakeups looked unexpected occasionally. To better understand
the behavior, we constructed a deterministic test that issues a
sync_wakeup in the window where (1) the waker is the only runnable
task, and (2) there is a delayed task on its runqueue. In this case,
wake_affine_idle returns the parent's CPU that is about to be idle,
but select_idle_sibling chooses an idle sibling instead (lost the
chance to use the warm core; the idle sibling can be running at a low
frequency, the child’s performance will be affected more).
We think it's good to keep the sync-wakeup logic consistent between
wake_affine_idle() and select_idle_sibling(). wake_affine_idle() use
the predicate "nr_running - cfs_h_nr_delayed(this_rq()) <= 1" for sync
wakeup, but the later select_idle_sibling use the predicate
"nr_running <=1" for sync wakeup.
Sorry for the duplicate email; I added linux-kernel to CC this time.
Best,
Tingjia
On Tue, Dec 2, 2025 at 12:19 AM Tingjia Cao <tjcao980311@...il.com> wrote:
>
> Hello Prateek,
>
> Thanks for the explanation, and we didn't observe a clear performance impact in our workloads.
>
> We originally noticed this issue because some CPU selection decisions for sync wakeups looked unexpected occasionally. To better understand the behavior, we constructed a deterministic test that issues a sync_wakeup in the window where (1) the waker is the only runnable task, and (2) there is a delayed task on its runqueue. In this case, wake_affine_idle returns the parent's CPU that is about to be idle, but select_idle_sibling chooses an idle sibling instead (lost the chance to use the warm core; the idle sibling can be running at a low frequency, the child’s performance will be affected more).
>
> We think it's good to keep the sync-wakeup logic consistent between wake_affine_idle() and select_idle_sibling(). wake_affine_idle() use the predicate "nr_running - cfs_h_nr_delayed(this_rq()) <= 1" for sync wakeup, but the later select_idle_sibling use the predicate "nr_running <=1" for sync wakeup.
>
> Best,
> Tingjia
>
> On Mon, Nov 24, 2025 at 12:07 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>>
>> Hello Tingjia,
>>
>> On 11/23/2025 9:34 AM, Tingjia Cao wrote:
>> > Recently, we encountered an issue that sync wakeup kthread didn't choose the current CPU though the waker is the only runnable task. It is caused by a conflict between delayed dequeue feature and select_idle_sibling function.
>> >
>> > With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may not be removed from the runqueue immediately. As a result, nr_running may overcount the number of runnable tasks. Inside select_idle_sibling, there is a special case for sync wakeup:
>> >
>> > if (is_per_cpu_kthread(current) &&
>> > in_task() &&
>> > prev == smp_processor_id() &&
>> > this_rq()->nr_running <= 1 &&
>> > asym_fits_cpu(...)) {
>> > return prev;
>> > }
>> >
>> > For "this_rq()->nr_running <= 1": we should use the real running-tasks rq to check whether to place the wake-up task to the current cpu.
>> >
>> > To fix this (patch attached), we can use the true number of runnable tasks by subtracting the delayed-dequeue count:
>> >
>> > this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1
>>
>> This is a very transient state - tasks cannot be delayed without other
>> runnable tasks at the time of dequeue and soon after the dequeue of
>> last runnable task, all the pending delayed tasks would get dequeued.
>> The window is actually very small. Does this make a difference in
>> your workload performance?
>>
>> Once all tasks are dequeued, the newidle balance should run on the CPU
>> going idle to help reduce any imbalance.
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
Powered by blists - more mailing lists