[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZTIt/GGYtohNYx7f@chenyu5-mobl2.ccr.corp.intel.com>
Date: Fri, 20 Oct 2023 15:36:28 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Yicong Yang <yangyicong@...wei.com>
CC: <peterz@...radead.org>, <mingo@...hat.com>,
<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <tim.c.chen@...ux.intel.com>,
<gautham.shenoy@....com>, <mgorman@...e.de>, <vschneid@...hat.com>,
<linux-kernel@...r.kernel.org>,
<linux-arm-kernel@...ts.infradead.org>, <rostedt@...dmis.org>,
<bsegall@...gle.com>, <bristot@...hat.com>,
<prime.zeng@...wei.com>, <yangyicong@...ilicon.com>,
<jonathan.cameron@...wei.com>, <ego@...ux.vnet.ibm.com>,
<srikar@...ux.vnet.ibm.com>, <linuxarm@...wei.com>,
<21cnbao@...il.com>, <kprateek.nayak@....com>,
<wuyun.abel@...edance.com>
Subject: Re: [PATCH v11 3/3] sched/fair: Use candidate prev/recent_used CPU
if scanning failed for cluster wakeup
On 2023-10-19 at 11:33:23 +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@...ilicon.com>
>
> Chen Yu reports a hackbench regression of cluster wakeup when
> hackbench threads equal to the CPU number [1]. Analysis shows
> it's because we wake up more on the target CPU even if the
> prev_cpu is a good wakeup candidate and leads to the decrease
> of the CPU utilization.
>
> Generally if the task's prev_cpu is idle we'll wake up the task
> on it without scanning. On cluster machines we'll try to wake up
> the task in the same cluster of the target for better cache
> affinity, so if the prev_cpu is idle but not sharing the same
> cluster with the target we'll still try to find an idle CPU within
> the cluster. This will improve the performance at low loads on
> cluster machines. But in the issue above, if the prev_cpu is idle
> but not in the cluster with the target CPU, we'll try to scan an
> idle one in the cluster. But since the system is busy, we're
> likely to fail the scanning and use target instead, even if
> the prev_cpu is idle. Then leads to the regression.
>
> This patch solves this in 2 steps:
> o record the prev_cpu/recent_used_cpu if they're good wakeup
> candidates but not sharing the cluster with the target.
> o on scanning failure use the prev_cpu/recent_used_cpu if
> they're recorded as idle
>
> [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/
>
> Reported-by: Chen Yu <yu.c.chen@...el.com>
> Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
> Signed-off-by: Yicong Yang <yangyicong@...ilicon.com>
>
Tested on 24 CPUs Jacobsville machine, 4 CPUs in one cluster sharing L2 Cache.
The baseline is sched/core on top of
commit a36e5741bdc5 ("sched: Fix stop_one_cpu_nowait() vs hotplug"),
and compared with the whole patch set applied. I did not see any regression but
improvement on hackbench, please feel free to add:
Tested-and-reviewed-by: Chen Yu <yu.c.chen@...el.com>
hackbench
=========
case load baseline(std%) compare%( std%)
process-pipe 1-groups 1.00 ( 0.26) +6.02 ( 1.53)
process-pipe 2-groups 1.00 ( 1.03) +1.97 ( 0.70)
process-pipe 4-groups 1.00 ( 0.26) +1.80 ( 3.27)
process-sockets 1-groups 1.00 ( 0.29) +1.63 ( 0.86)
process-sockets 2-groups 1.00 ( 1.17) +2.59 ( 0.35)
process-sockets 4-groups 1.00 ( 1.07) +3.86 ( 0.51)
threads-pipe 1-groups 1.00 ( 0.79) +8.17 ( 0.48)
threads-pipe 2-groups 1.00 ( 0.65) +6.34 ( 0.23)
threads-pipe 4-groups 1.00 ( 0.38) +4.61 ( 1.04)
threads-sockets 1-groups 1.00 ( 0.73) +0.80 ( 0.35)
threads-sockets 2-groups 1.00 ( 1.09) +2.81 ( 1.18)
threads-sockets 4-groups 1.00 ( 0.67) +2.30 ( 0.20)
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 6-threads 1.00 ( 0.48) +3.97 ( 0.50)
TCP_RR 12-threads 1.00 ( 0.11) +3.83 ( 0.15)
TCP_RR 18-threads 1.00 ( 0.18) +7.53 ( 0.18)
TCP_RR 24-threads 1.00 ( 0.34) +2.40 ( 0.77)
TCP_RR 30-threads 1.00 ( 10.39) +2.22 ( 11.51)
TCP_RR 36-threads 1.00 ( 10.87) +2.06 ( 16.71)
TCP_RR 42-threads 1.00 ( 14.04) +2.10 ( 12.86)
TCP_RR 48-threads 1.00 ( 5.89) +2.15 ( 5.54)
UDP_RR 6-threads 1.00 ( 0.20) +2.99 ( 0.55)
UDP_RR 12-threads 1.00 ( 0.18) +3.65 ( 0.27)
UDP_RR 18-threads 1.00 ( 0.34) +6.62 ( 0.23)
UDP_RR 24-threads 1.00 ( 0.60) -1.73 ( 12.54)
UDP_RR 30-threads 1.00 ( 9.70) -0.62 ( 14.34)
UDP_RR 36-threads 1.00 ( 11.80) -0.05 ( 12.27)
UDP_RR 42-threads 1.00 ( 15.35) -0.05 ( 12.26)
UDP_RR 48-threads 1.00 ( 5.58) -0.12 ( 5.73)
tbench
======
case load baseline(std%) compare%( std%)
loopback 6-threads 1.00 ( 0.29) +2.51 ( 0.24)
loopback 12-threads 1.00 ( 0.08) +2.90 ( 0.47)
loopback 18-threads 1.00 ( 0.06) +6.85 ( 0.07)
loopback 24-threads 1.00 ( 0.20) +1.85 ( 0.14)
loopback 30-threads 1.00 ( 0.15) +1.37 ( 0.07)
loopback 36-threads 1.00 ( 0.12) +1.34 ( 0.07)
loopback 42-threads 1.00 ( 0.09) +0.91 ( 0.04)
loopback 48-threads 1.00 ( 0.11) +0.88 ( 0.05)
schbench
========
case load baseline(std%) compare%( std%)
normal 1-mthreads 1.00 ( 2.67) -1.89 ( 0.00)
normal 2-mthreads 1.00 ( 0.00) +0.00 ( 0.00)
normal 4-mthreads 1.00 ( 8.08) +12.86 ( 2.32)
thanks,
Chenyu
Powered by blists - more mailing lists