linux-kernel - Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZTIt/GGYtohNYx7f@chenyu5-mobl2.ccr.corp.intel.com>
Date:   Fri, 20 Oct 2023 15:36:28 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Yicong Yang <yangyicong@...wei.com>
CC:     <peterz@...radead.org>, <mingo@...hat.com>,
        <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
        <dietmar.eggemann@....com>, <tim.c.chen@...ux.intel.com>,
        <gautham.shenoy@....com>, <mgorman@...e.de>, <vschneid@...hat.com>,
        <linux-kernel@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>, <rostedt@...dmis.org>,
        <bsegall@...gle.com>, <bristot@...hat.com>,
        <prime.zeng@...wei.com>, <yangyicong@...ilicon.com>,
        <jonathan.cameron@...wei.com>, <ego@...ux.vnet.ibm.com>,
        <srikar@...ux.vnet.ibm.com>, <linuxarm@...wei.com>,
        <21cnbao@...il.com>, <kprateek.nayak@....com>,
        <wuyun.abel@...edance.com>
Subject: Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU
 if scanning failed for cluster wakeup

On 2023-10-19 at 11:33:23 +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@...ilicon.com>
> 
> Chen Yu reports a hackbench regression of cluster wakeup when
> hackbench threads equal to the CPU number [1]. Analysis shows
> it's because we wake up more on the target CPU even if the
> prev_cpu is a good wakeup candidate and leads to the decrease
> of the CPU utilization.
> 
> Generally if the task's prev_cpu is idle we'll wake up the task
> on it without scanning. On cluster machines we'll try to wake up
> the task in the same cluster of the target for better cache
> affinity, so if the prev_cpu is idle but not sharing the same
> cluster with the target we'll still try to find an idle CPU within
> the cluster. This will improve the performance at low loads on
> cluster machines. But in the issue above, if the prev_cpu is idle
> but not in the cluster with the target CPU, we'll try to scan an
> idle one in the cluster. But since the system is busy, we're
> likely to fail the scanning and use target instead, even if
> the prev_cpu is idle. Then leads to the regression.
> 
> This patch solves this in 2 steps:
> o record the prev_cpu/recent_used_cpu if they're good wakeup
>   candidates but not sharing the cluster with the target.
> o on scanning failure use the prev_cpu/recent_used_cpu if
>   they're recorded as idle
> 
> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/
> 
> Reported-by: Chen Yu <yu.c.chen@...el.com>
> Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
> Signed-off-by: Yicong Yang <yangyicong@...ilicon.com>
>

Tested on 24 CPUs Jacobsville machine, 4 CPUs in one cluster sharing L2 Cache.
The baseline is sched/core on top of 
commit a36e5741bdc5 ("sched: Fix stop_one_cpu_nowait() vs hotplug"),
and compared with the whole patch set applied. I did not see any regression but
improvement on hackbench, please feel free to add:

Tested-and-reviewed-by: Chen Yu <yu.c.chen@...el.com>


hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  0.26)	 +6.02 (  1.53)
process-pipe    	2-groups	 1.00 (  1.03)	 +1.97 (  0.70)
process-pipe    	4-groups	 1.00 (  0.26)	 +1.80 (  3.27)
process-sockets 	1-groups	 1.00 (  0.29)	 +1.63 (  0.86)
process-sockets 	2-groups	 1.00 (  1.17)	 +2.59 (  0.35)
process-sockets 	4-groups	 1.00 (  1.07)	 +3.86 (  0.51)
threads-pipe    	1-groups	 1.00 (  0.79)	 +8.17 (  0.48)
threads-pipe    	2-groups	 1.00 (  0.65)	 +6.34 (  0.23)
threads-pipe    	4-groups	 1.00 (  0.38)	 +4.61 (  1.04)
threads-sockets 	1-groups	 1.00 (  0.73)	 +0.80 (  0.35)
threads-sockets 	2-groups	 1.00 (  1.09)	 +2.81 (  1.18)
threads-sockets 	4-groups	 1.00 (  0.67)	 +2.30 (  0.20)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.48)	 +3.97 (  0.50)
TCP_RR          	12-threads	 1.00 (  0.11)	 +3.83 (  0.15)
TCP_RR          	18-threads	 1.00 (  0.18)	 +7.53 (  0.18)
TCP_RR          	24-threads	 1.00 (  0.34)	 +2.40 (  0.77)
TCP_RR          	30-threads	 1.00 ( 10.39)	 +2.22 ( 11.51)
TCP_RR          	36-threads	 1.00 ( 10.87)	 +2.06 ( 16.71)
TCP_RR          	42-threads	 1.00 ( 14.04)	 +2.10 ( 12.86)
TCP_RR          	48-threads	 1.00 (  5.89)	 +2.15 (  5.54)
UDP_RR          	6-threads	 1.00 (  0.20)	 +2.99 (  0.55)
UDP_RR          	12-threads	 1.00 (  0.18)	 +3.65 (  0.27)
UDP_RR          	18-threads	 1.00 (  0.34)	 +6.62 (  0.23)
UDP_RR          	24-threads	 1.00 (  0.60)	 -1.73 ( 12.54)
UDP_RR          	30-threads	 1.00 (  9.70)	 -0.62 ( 14.34)
UDP_RR          	36-threads	 1.00 ( 11.80)	 -0.05 ( 12.27)
UDP_RR          	42-threads	 1.00 ( 15.35)	 -0.05 ( 12.26)
UDP_RR          	48-threads	 1.00 (  5.58)	 -0.12 (  5.73)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.29)	 +2.51 (  0.24)
loopback        	12-threads	 1.00 (  0.08)	 +2.90 (  0.47)
loopback        	18-threads	 1.00 (  0.06)	 +6.85 (  0.07)
loopback        	24-threads	 1.00 (  0.20)	 +1.85 (  0.14)
loopback        	30-threads	 1.00 (  0.15)	 +1.37 (  0.07)
loopback        	36-threads	 1.00 (  0.12)	 +1.34 (  0.07)
loopback        	42-threads	 1.00 (  0.09)	 +0.91 (  0.04)
loopback        	48-threads	 1.00 (  0.11)	 +0.88 (  0.05)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  2.67)	 -1.89 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 (  8.08)	+12.86 (  2.32)

thanks,
Chenyu