[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc582ddb-2f16-4c0b-be27-b9a1dedb646a@linux.ibm.com>
Date: Tue, 22 Jul 2025 01:07:09 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>, clm@...a.com
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com
Subject: Re: [PATCH v2 00/12] sched: Address schbench regression
On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
>
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
>
> Moo.. Are IPIs particularly expensive on your platform?
>
>
It seems like the cost of IPIs is likely hurting here.
IPI latency really depends on whether CPU was busy, shallow idle state or deep idle state.
When it is in deep idle state numbers show close to 5-8us on average on this small system.
When system is busy, (could be doing another schbench thread) is around 1-2us.
Measured the time it took for taking the remote rq lock in baseline, that is around 1-1.5us only.
Also, here LLC is small core.(SMT4 core). So quite often the series would choose to send IPI.
Did one more experiment, pin worker and message thread such that it always sends IPI.
NO_TTWU_QUEUE_DELAYED
./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
average rps: 1549224.72
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5
average rps: 1560839.00
TTWU_QUEUE_DELAYED
./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5 << IPI could be sent quite often ***
average rps: 959522.31
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5 << IPI are always sent. (M,W) don't share cache.
average rps: 470865.00 << rps goes even lower
=================================
*** issues/observations in schbench.
Chris,
When one does -W auto or -M auto i think code is meant to run, n message threads on first n CPUs and worker threads
on remaining CPUs?
I don't see that happening. above behavior can be achieved only with -M <cpus> -W <cpus>
int i = 0;
CPU_ZERO(m_cpus);
for (int i = 0; i < m_threads; ++i) {
CPU_SET(i, m_cpus);
CPU_CLR(i, w_cpus);
}
for (; i < CPU_SETSIZE; i++) { << here i refers to the one in scope. which is 0. Hence w_cpus is set for all cpus.
And hence workers end up running on all CPUs even with -W auto
CPU_SET(i, w_cpus);
}
Another issue, is that if CPU0 if offline, then auto pinning fails. Maybe no one cares about that case?
Powered by blists - more mailing lists