[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6787831-f659-12cb-4954-fd13a05ed590@amd.com>
Date: Wed, 13 Dec 2023 12:07:26 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>,
Joel Fernandes <joelaf@...gle.com>,
Qais Yousef <qyousef@...gle.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>,
Youssef Esmat <youssefesmat@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>,
Boqun Feng <boqun.feng@...il.com>,
"Paul E . McKenney" <paulmck@...nel.org>, kernel-team@...roid.com
Subject: Re: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority
Inheritance v6
Hello John,
I may have some data that might help you debug a potential performance
issue mentioned below.
On 11/7/2023 1:04 AM, John Stultz wrote:
> [..snip..]
>
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather
> than optimistic spinning. This is a potential concern where locks
> are under high contention. However, earlier performance analysis
> (on both x86 and mobile devices) did not see major regressions.
> That said, Chenyu did report a regression[3], which I’ll need to
> look further into.
I too see this as the most notable regression. Some of the other
benchmarks I've tested (schbench, tbench, netperf, ycsb-mongodb,
DeathStarBench) show little to no difference when running with Proxy
Execution, however sched-messaging sees a 10x blowup in the runtime.
(taskset -c 0-7,128-125 perf bench sched messaging -p -t -l 100000 -g 1)
While investigating, I drew up the runqueue length when running
sched-messaging pinned to 1CCX (CPUs 0-7,128-125 on my 3rd Generation
EPYC system) using the following bpftrace script that dumps it csv
format:
rqlen.bt
---
BEGIN
{
$i = 0;
printf("[%12s],", "Timestamp");
while ($i < 8)
{
printf("CPU%3d,", $i);
$i = $i + 1;
}
$i = 128;
while ($i < 136)
{
printf("CPU%3d,", $i);
$i = $i + 1;
}
printf("\n");
}
kprobe:scheduler_tick
{
@runqlen[curtask->wake_cpu] = curtask->se.cfs_rq->rq->nr_running;
}
tracepoint:power:cpu_idle
{
@runqlen[curtask->wake_cpu] = 0;
}
interval:hz:50
{
$i = 0;
printf("[%12lld],", elapsed);
while ($i < 8)
{
printf("%3d,", @runqlen[$i]);
$i = $i + 1;
}
$i=128;
while ($i < 136)
{
printf("%3d,", @runqlen[$i]);
$i = $i + 1;
}
printf("\n");
}
END
{
clear(@runqlen);
}
--
I've attached the csv for tip (rqlen50-tip-pinned.csv) and proxy
execution (rqlen50-pe-pinned.csv) below.
The trend I see with hackbench is that the chain migration leads
to a single runqueue being completely overloaded, followed by some
amount of the idling on the entire CCX and a similar chain appearing
on a different CPU. The trace for tip show a lot more CPUs being
utilized.
Mathieu has been looking at hackbench and the effect of task migration
on the runtime and it appears that lowering the migrations improves
the hackbench performance significantly [1][2][3]
[1] https://lore.kernel.org/lkml/20230905072141.GA253439@ziqianlu-dell/
[2] https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
[3] https://lore.kernel.org/lkml/20231019160523.1582101-1-mathieu.desnoyers@efficios.com/
Since migration itself is not cheap, I believe the chain migration at
the current scale hampers the performance since sched-messaging
emulates a worst-case scenario for proxy-execution.
I'll update the thread once I have more information. I'll continue
testing and take a closer look at the implementation.
> I also briefly re-tested with this v5 series
> and saw some average latencies grow vs v4, suggesting the changes
> to return-migration (and extra locking) have some impact. With v6
> the extra overhead is reduced but still not as nice as v4. I’ll
> be digging more there, but my priority is still stability over
> speed at this point (it’s easier to validate correctness of
> optimizations if the baseline isn’t crashing).
>
>
> If folks find it easier to test/tinker with, this patch series
> can also be found here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
> https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6
P.S. I was using the above tree.
>
> Awhile back Sage Sharp had a nice blog post about types of
> reviews [4], and while any review and feedback would be greatly
> appreciated, those focusing on conceptual design or correctness
> issues would be *especially* valued.
I have skipped a few phases that Sage mentions in their blog but I'll
try my best to follow the order from here on forward :)
>
> Thanks so much!
> -john
>
> [1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> [4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/
>
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Download attachment "rqlen50-pe-pinned.csv" of type "application/vnd.ms-excel" (198231 bytes)
Download attachment "rqlen50-tip-pinned.csv" of type "application/vnd.ms-excel" (23111 bytes)
Powered by blists - more mailing lists