linux-kernel - Re: [PATCH] sched/fair: select waker's cpu for wakee on sync wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3e248fab-9ee3-3041-f4d5-99d6313f018b@linux.alibaba.com>
Date:   Fri, 26 Aug 2022 10:43:04 +0800
From:   Peng Wang <rocking@...ux.alibaba.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, bristot@...hat.com,
        vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: select waker's cpu for wakee on sync wakeup

On 25/08/2022 17:09, , Mel Gorman wrote:
> On Thu, Aug 25, 2022 at 02:45:05PM +0800, Peng Wang wrote:
>> On 24/08/2022 16:46, , Mel Gorman wrote:
>>> On Wed, Aug 24, 2022 at 12:37:50PM +0800, Peng Wang wrote:
>>>> On sync wakeup, waker is about to sleep, and if it is the only
>>>> running task, wakee can get warm data on waker's cpu.
>>>>
>>>> Unixbench, schbench, and hackbench are tested on
>>>> Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (192 logic CPUs)
>>>>
>>>> Unixbench get +20.7% improvement with full threads mainly
>>>> because of the pipe-based context switch and fork test.
>>>>
>>>> No obvious impact on schbench.
>>>>
>>>> This change harms hackbench with lower concurrency, while gets improvement
>>>> when concurrency increases.
>>>>
>>>
>>> Note that historically patches in this direction have been hazardous because
>>> it makes a key assumption "sync wakers always go to sleep in the near future"
>>> when the sync hint is not that reliable. Networking from a brief glance
>>> still uses sync wakeups where wakers could have a 1:N relationship between
>>> work producers and work consumers that would then stack multiple tasks on
>>> one CPU for multiple consumers. The workloads mentioned in the changelog
>>> are mostly strictly-synchronous wakeups (i.e. the waker definitely goes
>>> to sleep almost immediately) and benefit from this sort of patch but it's
>>> not necessarily a universal benefit.
>>
>> Hi, Mel
>>
>> Thanks for your clarification.
>>
>> Besides these benchmarks, I also find a similar strictly-synchronous wakeup
>> case [1].
>>
>> [1]https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1478754.html
>>
> 
> Yep, but it falls under the same heading, sometimes the caller knows it's
> a strict sync wakeup but not always.
> 
>>>
>>> Note that most of these hazards occurred *LONG* before I was paying much
>>> attention to how the scheduler behaved so I cannot state "sync is still
>>> unreliable" with absolute certainty. However, long ago there was logic
>>> that tried to track the accuracy of the sync hint that was ultimately
>>> abandoned by commit e12f31d3e5d3 ("sched: Remove avg_overlap"). AFAIK,
>>> the sync hint is still not 100% reliable and while stacking sync works
>>> for some workloads, it's likely to be a regression magnet for network
>>> intensive workloads or client/server workloads like databases where
>>> "synchronous wakeups are not always synchronous".
>>>
>> Yes, you are right. Perhaps in such situation, a strong contract from user
>> is a better alternative than struggling with the weak hint in kernel.
>>
> 
> Even the kernel doesn't always know if a wakeup is really sync or not
> because it lacks valuable context and the number of tasks on the runqueue is
> insufficient if there are multiple wakeups in quick succession. At best,
> there could be two WF_SYNC hints and hope every caller gets it right
> (hint, they won't because even if it's right once, cargo cult copying
> will eventually get it wrong and there is an API explosion issue such as
> wake_up_interruptible_*). A user hint would be tricky. Core libraries
> couldn't use it because it has no idea if the linked application wants
> a strictly sync wakeup or not, a core library couldn't tell given just
> a pthread_mute_t for example. Even if it was true at one point in time,

OK, I get it now, thanks!

If we passed more information dealing with pthread_mute_t, it would
bring too much changes through user core libraries to this kernel
scheduling decision.

And the current weak sync-wakeup hint can only bring us a candidate
in the same LLC cache domain at most.

> it might not be true later if the application design changed leading to
> application bugs being blamed on the kernel for poor placement decisions.
>