linux-kernel - Re: [RFC PATCH 07/11] sched: Add proxy execution

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Y7vVqE0M/AoDoVbj@chenyu5-mobl1>
Date:   Mon, 9 Jan 2023 16:51:52 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Connor O'Brien <connoro@...gle.com>
CC:     <linux-kernel@...r.kernel.org>, <kernel-team@...roid.com>,
        John Stultz <jstultz@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qais.yousef@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Mel Gorman" <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Paul E . McKenney" <paulmck@...nel.org>
Subject: Re: [RFC PATCH 07/11] sched: Add proxy execution

On 2022-10-03 at 21:44:57 +0000, Connor O'Brien wrote:
> From: Peter Zijlstra <peterz@...radead.org>
> 
> A task currently holding a mutex (executing a critical section) might
> find benefit in using scheduling contexts of other tasks blocked on the
> same mutex if they happen to have higher priority of the current owner
> (e.g., to prevent priority inversions).
> 
> Proxy execution lets a task do exactly that: if a mutex owner has
> waiters, it can use waiters' scheduling context to potentially continue
> running if preempted.
> 
> The basic mechanism is implemented by this patch, the core of which
> resides in the proxy() function. Potential proxies (i.e., tasks blocked
> on a mutex) are not dequeued, so, if one of them is actually selected by
> schedule() as the next task to be put to run on a CPU, proxy() is used
> to walk the blocked_on relation and find which task (mutex owner) might
> be able to use the proxy's scheduling context.
> 
> Here come the tricky bits. In fact, owner task might be in all sort of
> states when a proxy is found (blocked, executing on a different CPU,
> etc.). Details on how to handle different situations are to be found in
> proxy() code comments.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> [rebased, added comments and changelog]
> Signed-off-by: Juri Lelli <juri.lelli@...hat.com>
> [Fixed rebase conflicts]
> [squashed sched: Ensure blocked_on is always guarded by blocked_lock]
> Signed-off-by: Valentin Schneider <valentin.schneider@....com>
> [fix rebase conflicts, various fixes & tweaks commented inline]
> [squashed sched: Use rq->curr vs rq->proxy checks]
> Signed-off-by: Connor O'Brien <connoro@...gle.com>
> Link: https://lore.kernel.org/all/20181009092434.26221-6-juri.lelli@redhat.com/
>
While we are evaluating linux scalability on high core count system, we found
that hackbench pipe mode is sensitive to mutex_lock(). So we looked at this patch
set to see if it makes any difference.

Tested on a system with 112 online CPUs. The hackbench launches 8 groups, each
group has 14 pairs of sender/receiver. Every sender sends data via pipe to each
receiver in the same group, thus it is a 14 * 14 reproducer/consumer matrix in
every group, and 14 * 14 * 8 concurrent read/write in total. With pe patch applied,
we observed some throughput regression. According to perf profiling, the system has
higher idle time(raised from 16% to 39%).

And one hot path is:
do_idle() -> schedule_idle() -> __schedule() -> pick_next_task_fair() -> newidle_balance()
It seems that the systems spends quite some time in newidle_balance() which
is costly on a 112 CPUs system. My understanding is that, if the CPU
quits idle in do_idle(), it is very likely to have a pending ttwu request. So
when this CPU further reaches schedule_idle()->pick_next_task_fair(),
it should find a candidate task. But unfortunately in our case it fails to find any
runnable task, thus has to pick the idle task, which triggers the costly
load balance. Is this caused by a race condition that someone else migtate the runnable
task to the other CPU? Besides, a more generic question is, if the mutex lock owners/blockers
have the same priority, is proxy execution suitible for this scenario?

I'm glad to test with what you suggest, appreciated for any suggestion.

The flamegraph of vanilla 6.0-rc7:
https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/proxy_execution/flamegraph_vanilla_6.0.0-rc7.svg

The flamegraph of vanilla 6.0-rc7 + proxy execution:
https://raw.githubusercontent.com/yu-chen-surf/schedtests/master/results/proxy_execution/flamegraph_pe_6.0.0-rc7.svg

thanks,
Chenyu