[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zq7+G1YmKG71WIQ5@chenyu5-mobl2>
Date: Sun, 4 Aug 2024 12:05:47 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>, "Dietmar
Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Daniel
Bristot de Oliveira" <bristot@...hat.com>, Valentin Schneider
<vschneid@...hat.com>, "Paul E. McKenney" <paulmck@...nel.org>, Imran Khan
<imran.f.khan@...cle.com>, Leonardo Bras <leobras@...hat.com>, Guo Ren
<guoren@...nel.org>, Rik van Riel <riel@...riel.com>, Tejun Heo
<tj@...nel.org>, Cruz Zhao <CruzZhao@...ux.alibaba.com>, Lai Jiangshan
<jiangshanlai@...il.com>, Joel Fernandes <joel@...lfernandes.org>, Zqiang
<qiang.zhang1211@...il.com>, Julia Lawall <julia.lawall@...ia.fr>, "Gautham
R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry
fast-path in __schedule()
On 2024-07-31 at 00:13:40 +0800, Chen Yu wrote:
> On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:
> > From: Peter Zijlstra <peterz@...radead.org>
> >
> > Since commit b2a02fc43a1f ("smp: Optimize
> > send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
> > can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
> > IPI without actually sending an interrupt. Even in cases where the IPI
> > handler does not queue a task on the idle CPU, do_idle() will call
> > __schedule() since need_resched() returns true in these cases.
> >
> > Introduce and use SM_IDLE to identify call to __schedule() from
> > schedule_idle() and shorten the idle re-entry time by skipping
> > pick_next_task() when nr_running is 0 and the previous task is the idle
> > task.
> >
> > With the SM_IDLE fast-path, the time taken to complete a fixed set of
> > IPIs using ipistorm improves significantly. Following are the numbers
> > from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
> > C2 disabled) running ipistorm between CPU8 and CPU16:
> >
> > cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
> >
> > ==================================================================
> > Test : ipistorm (modified)
> > Units : Normalized runtime
> > Interpretation: Lower is better
> > Statistic : AMean
> > ==================================================================
> > kernel: time [pct imp]
> > tip:sched/core 1.00 [baseline]
> > tip:sched/core + SM_IDLE 0.25 [75.11%]
> >
> > [ kprateek: Commit log and testing ]
> >
> > Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
> > Not-yet-signed-off-by: Peter Zijlstra <peterz@...radead.org>
> > Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> >
>
> Only with current patch applied on top of sched/core commit c793a62823d1,
> a significant throughput/run-to-run variance improvement is observed
> on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
> CPU frequency governor is set to performance and turbo-boost disabled.
>
> Without the patch(lower the better):
>
> 158490995
> 113086433
> 737869191
> 302454894
> 731262790
> 677283357
> 729767478
> 830949261
> 399824606
> 743681976
>
> (Amean): 542467098
> (Std): 257011706
>
>
> With the patch(lower the better):
> 128060992
> 115646768
> 132734621
> 150330954
> 113143538
> 169875051
> 145010400
> 151589193
> 162165800
> 159963320
>
> (Amean): 142852063
> (Std): 18646313
>
> I've launched full tests for schbench/hackbench/netperf/tbench
> to see if there is any difference.
>
Tested without CONFIG_PREEMPT_RT, so issue for SM_RTLOCK_WAIT as mentioned
by Vincent might not bring any impact. There is no obvious difference
(regression) detected according to the test in the 0day environment. Overall
this patch looks good to me. Once you send a refresh version out I'll re-launch
the test.
Tested on Xeon server with 128 CPUs, 4 Numa nodes, under different
baseline with-SM_IDLE
hackbench
load level (25% ~ 100%)
hackbench-pipe-process.throughput
%25:
846099 -0.3% 843217
%50:
972015 +0.0% 972185
%100:
1395650 -1.3% 1376963
hackbench-pipe-threads.throughput
%25:
746629 -0.0% 746345
%50:
885165 -0.4% 881602
%100:
1227790 +1.3% 1243757
hackbench-socket-process.throughput
%25:
395784 +1.2% 400717
%50:
441312 +0.3% 442783
%100:
324283 ± 2% +6.0% 343826
hackbench-socket-threads.throughput
%25:
379700 -0.8% 376642
%50:
425315 -0.4% 423749
%100:
311937 ± 2% +0.9% 314892
baseline with-SM_IDLE
schbench.request_latency_90%_us
1-mthread-1-worker:
4562 -0.0% 4560
1-mthread-16-workers:
4564 -0.0% 4563
12.5%-mthread-1:
4565 +0.0% 4567
12.5%-mthread-16-workers:
39204 +0.1% 39248
25%-mthread-1-worker:
4574 +0.0% 4574
25%-mthread-16-workers:
161944 +0.1% 162053
50%-mthread-1-workers:
4784 ± 5% +0.1% 4789 ± 5%
50%-mthread-16-workers:
659156 +0.4% 661679
100%-mthread-1-workers:
9328 +0.0% 9329
100%-mthread-16-workers:
2489753 -0.7% 2472140
baseline with-SM_IDLE
netperf.Throughput:
25%-TCP_RR:
2449875 +0.0% 2450622 netperf.Throughput_total_tps
25%-UDP_RR:
2746806 +0.1% 2748935 netperf.Throughput_total_tps
25%-TCP_STREAM:
1352061 +0.7% 1361497 netperf.Throughput_total_Mbps
25%-UDP_STREAM:
1815205 +0.1% 1816202 netperf.Throughput_total_Mbps
50%-TCP_RR:
3981514 -0.3% 3970327 netperf.Throughput_total_tps
50%-UDP_RR:
4496584 -1.3% 4438363 netperf.Throughput_total_tps
50%-TCP_STREAM:
1478872 +0.4% 1484196 netperf.Throughput_total_Mbps
50%-UDP_STREAM:
1739540 +0.3% 1744074 netperf.Throughput_total_Mbps
75%-TCP_RR:
3696607 -0.5% 3677044 netperf.Throughput_total_tps
75%-UDP_RR:
4161206 +1.3% 4217274 ± 2% netperf.Throughput_total_tps
75%-TCP_STREAM:
895874 +5.7% 946546 ± 5% netperf.Throughput_total_Mbps
75%-UDP_STREAM:
4100019 -0.3% 4088367 netperf.Throughput_total_Mbps
100%-TCP_RR:
6724456 -1.7% 6610976 netperf.Throughput_total_tps
100%-UDP_RR:
7329959 -0.5% 7294653 netperf.Throughput_total_tps
100%-TCP_STREAM:
808165 +0.3% 810360 netperf.Throughput_total_Mbps
100%-UDP_STREAM:
5562651 +0.0% 5564106 netperf.Throughput_total_Mbps
thanks,
Chenyu
Powered by blists - more mailing lists