linux-kernel - Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry fast-path in _

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8bcdee16-88d5-7ec5-7d88-1ac11566c28a@amd.com>
Date: Mon, 5 Aug 2024 09:33:28 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>, "Dietmar
 Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Daniel
 Bristot de Oliveira" <bristot@...hat.com>, Valentin Schneider
	<vschneid@...hat.com>, "Paul E. McKenney" <paulmck@...nel.org>, Imran Khan
	<imran.f.khan@...cle.com>, Leonardo Bras <leobras@...hat.com>, Guo Ren
	<guoren@...nel.org>, Rik van Riel <riel@...riel.com>, Tejun Heo
	<tj@...nel.org>, Cruz Zhao <CruzZhao@...ux.alibaba.com>, Lai Jiangshan
	<jiangshanlai@...il.com>, Joel Fernandes <joel@...lfernandes.org>, Zqiang
	<qiang.zhang1211@...il.com>, Julia Lawall <julia.lawall@...ia.fr>, "Gautham
 R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [PATCH 2/3] sched/core: Introduce SM_IDLE and an idle re-entry
 fast-path in __schedule()

Hello Chenyu,

Thank you for testing the series. I'll have a second version out soon.

On 8/4/2024 9:35 AM, Chen Yu wrote:
> On 2024-07-31 at 00:13:40 +0800, Chen Yu wrote:
>> On 2024-07-10 at 09:02:09 +0000, K Prateek Nayak wrote:
>>> From: Peter Zijlstra <peterz@...radead.org>
>>>
>>> Since commit b2a02fc43a1f ("smp: Optimize
>>> send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode
>>> can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an
>>> IPI without actually sending an interrupt. Even in cases where the IPI
>>> handler does not queue a task on the idle CPU, do_idle() will call
>>> __schedule() since need_resched() returns true in these cases.
>>>
>>> Introduce and use SM_IDLE to identify call to __schedule() from
>>> schedule_idle() and shorten the idle re-entry time by skipping
>>> pick_next_task() when nr_running is 0 and the previous task is the idle
>>> task.
>>>
>>> With the SM_IDLE fast-path, the time taken to complete a fixed set of
>>> IPIs using ipistorm improves significantly. Following are the numbers
>>> from a dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on,
>>> C2 disabled) running ipistorm between CPU8 and CPU16:
>>>
>>> cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1
>>>
>>>     ==================================================================
>>>     Test          : ipistorm (modified)
>>>     Units         : Normalized runtime
>>>     Interpretation: Lower is better
>>>     Statistic     : AMean
>>>     ==================================================================
>>>     kernel:				time [pct imp]
>>>     tip:sched/core			1.00 [baseline]
>>>     tip:sched/core + SM_IDLE		0.25 [75.11%]
>>>
>>> [ kprateek: Commit log and testing ]
>>>
>>> Link: https://lore.kernel.org/lkml/20240615012814.GP8774@noisy.programming.kicks-ass.net/
>>> Not-yet-signed-off-by: Peter Zijlstra <peterz@...radead.org>
>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
>>>
>>
>> Only with current patch applied on top of sched/core commit c793a62823d1,
>> a significant throughput/run-to-run variance improvement is observed
>> on an Intel 240 CPUs/ 2 Nodes server. C-states >= C1E are disabled,
>> CPU frequency governor is set to performance and turbo-boost disabled.
>>
>> Without the patch(lower the better):
>>
>> 158490995
>> 113086433
>> 737869191
>> 302454894
>> 731262790
>> 677283357
>> 729767478
>> 830949261
>> 399824606
>> 743681976
>>
>> (Amean): 542467098
>> (Std):   257011706
>>
>>
>> With the patch(lower the better):
>> 128060992
>> 115646768
>> 132734621
>> 150330954
>> 113143538
>> 169875051
>> 145010400
>> 151589193
>> 162165800
>> 159963320
>>
>> (Amean): 142852063
>> (Std):    18646313
>>
>> I've launched full tests for schbench/hackbench/netperf/tbench
>> to see if there is any difference.
>>
> 
> Tested without CONFIG_PREEMPT_RT, so issue for SM_RTLOCK_WAIT as mentioned
> by Vincent might not bring any impact. There is no obvious difference
> (regression) detected according to the test in the 0day environment. Overall
> this patch looks good to me. Once you send a refresh version out I'll re-launch
> the test.

Since SM_RTLOCK_WAIT is only used by schedule_rtlock(), which is only
defined for PREEMPT_RT kernels, non RT build should have no issue. I
could spot at least one case in rtlock_slowlock_locked() where the
pre->__state is set to "TASK_RTLOCK_WAIT" and schedule_rtlock() is
called. With this patch, it would pass the "sched_mode > SM_NONE" check
and call it an involuntary context-switch but on tip,
(preempt & SM_MASK_PREEMPT) would return false and eventually it'll
call deactivate_task() to dequeue the waiting task so this does need
fixing.

 From a brief look, all calls to schedule with "SM_RTLOCK_WAIT" already
set the task->__state to a non-zero value. I'll look into this further
after the respin and see if there is some optimization possible there
but for the time being, I'll respin this with the condition changed
to:

	...
     } else if (preempt != SM_PREEMPT && prev_state) {
	...

just to keep it explicit.

Thank you again for testing this version.
-- 
Thanks and Regards,
Prateek

> 
> Tested on Xeon server with 128 CPUs, 4 Numa nodes, under different
> 
>        baseline                  with-SM_IDLE
> 
> hackbench
> load level (25% ~ 100%)
> 
> hackbench-pipe-process.throughput
> %25:
>      846099            -0.3%     843217
> %50:
>      972015            +0.0%     972185
> %100:
>     1395650            -1.3%    1376963
> 
> hackbench-pipe-threads.throughput
> %25:
>      746629            -0.0%     746345
> %50:
>      885165            -0.4%     881602
> %100:
>     1227790            +1.3%    1243757
> 
> hackbench-socket-process.throughput
> %25:
>      395784            +1.2%     400717
> %50:
>      441312            +0.3%     442783
> %100:
>      324283 ±  2%      +6.0%     343826
> 
> hackbench-socket-threads.throughput
> %25:
>      379700            -0.8%     376642
> %50:
>      425315            -0.4%     423749
> %100:
>      311937 ±  2%      +0.9%     314892
> 
> 
> 
>        baseline                  with-SM_IDLE
> 
> schbench.request_latency_90%_us
> 
> 1-mthread-1-worker:
>        4562            -0.0%       4560
> 1-mthread-16-workers:
>        4564            -0.0%       4563
> 12.5%-mthread-1:
>        4565            +0.0%       4567
> 12.5%-mthread-16-workers:
>       39204            +0.1%      39248
> 25%-mthread-1-worker:
>        4574            +0.0%       4574
> 25%-mthread-16-workers:
>      161944            +0.1%     162053
> 50%-mthread-1-workers:
>        4784 ±  5%      +0.1%       4789 ±  5%
> 50%-mthread-16-workers:
>      659156            +0.4%     661679
> 100%-mthread-1-workers:
>        9328            +0.0%       9329
> 100%-mthread-16-workers:
>     2489753            -0.7%    2472140
> 
> 
>        baseline                  with-SM_IDLE
> 
> netperf.Throughput:
> 
> 25%-TCP_RR:
>     2449875            +0.0%    2450622        netperf.Throughput_total_tps
> 25%-UDP_RR:
>     2746806            +0.1%    2748935        netperf.Throughput_total_tps
> 25%-TCP_STREAM:
>     1352061            +0.7%    1361497        netperf.Throughput_total_Mbps
> 25%-UDP_STREAM:
>     1815205            +0.1%    1816202        netperf.Throughput_total_Mbps
> 50%-TCP_RR:
>     3981514            -0.3%    3970327        netperf.Throughput_total_tps
> 50%-UDP_RR:
>     4496584            -1.3%    4438363        netperf.Throughput_total_tps
> 50%-TCP_STREAM:
>     1478872            +0.4%    1484196        netperf.Throughput_total_Mbps
> 50%-UDP_STREAM:
>     1739540            +0.3%    1744074        netperf.Throughput_total_Mbps
> 75%-TCP_RR:
>     3696607            -0.5%    3677044        netperf.Throughput_total_tps
> 75%-UDP_RR:
>     4161206            +1.3%    4217274 ±  2%  netperf.Throughput_total_tps
> 75%-TCP_STREAM:
>      895874            +5.7%     946546 ±  5%  netperf.Throughput_total_Mbps
> 75%-UDP_STREAM:
>     4100019            -0.3%    4088367        netperf.Throughput_total_Mbps
> 100%-TCP_RR:
>     6724456            -1.7%    6610976        netperf.Throughput_total_tps
> 100%-UDP_RR:
>     7329959            -0.5%    7294653        netperf.Throughput_total_tps
> 100%-TCP_STREAM:
>      808165            +0.3%     810360        netperf.Throughput_total_Mbps
> 100%-UDP_STREAM:
>     5562651            +0.0%    5564106        netperf.Throughput_total_Mbps
> 
> thanks,
> Chenyu