[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6c0c135-9d8f-4d9d-8fc5-bc703cac9bdb@linux.ibm.com>
Date: Mon, 14 Jul 2025 23:24:36 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, clm@...a.com
Subject: Re: [PATCH v2 00/12] sched: Address schbench regression
On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
>
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
>
> Moo.. Are IPIs particularly expensive on your platform?
>
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
>
Some more data from the regression. I am looking at rps numbers
while running ./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5.
All the data is from an LPAR(VM) with 5 cores.
echo TTWU_QUEUE_DELAYED > features
average rps: 970491.00
echo NO_TTWU_QUEUE_DELAYED > features
current rps: 1555456.78
So below data points are with feature enabled or disabled with series applied + clm's patch.
-------------------------------------------------------
./hardirqs
TTWU_QUEUE_DELAYED
HARDIRQ TOTAL_usecs
env2 816
IPI-2 1421603 << IPI are less compared to with feature.
NO_TTWU_QUEUE_DELAYED
HARDIRQ TOTAL_usecs
ibmvscsi 8
env2 266
IPI-2 6489980
-------------------------------------------------------
Disabled all the idle states. Regression still exits.
-------------------------------------------------------
See this warning everytime i run schbench: This happens with PATCH 12/12 only.
It is triggering this warning. Some clock update is getting messed up?
1637 static inline void assert_clock_updated(struct rq *rq)
1638 {
1639 /*
1640 * The only reason for not seeing a clock update since the
1641 * last rq_pin_lock() is if we're currently skipping updates.
1642 */
1643 WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
1644 }
WARNING: kernel/sched/sched.h:1643 at update_load_avg+0x424/0x48c, CPU#6: swapper/6/0
CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Kdump: loaded Not tainted 6.16.0-rc4+ #276 PREEMPT(voluntary)
NIP: c0000000001cea60 LR: c0000000001d7254 CTR: c0000000001d77b0
REGS: c000000003a674c0 TRAP: 0700 Not tainted (6.16.0-rc4+)
MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 28008208 XER: 20040000
CFAR: c0000000001ce68c IRQMASK: 3
GPR00: c0000000001d7254 c000000003a67760 c000000001bc8100 c000000061915400
GPR04: c00000008c80f480 0000000000000005 c000000003a679b0 0000000000000000
GPR08: 0000000000000001 0000000000000000 c0000003ff14d480 0000000000004000
GPR12: c0000000001d77b0 c0000003ffff7880 0000000000000000 000000002eef18c0
GPR16: 0000000000000006 0000000000000006 0000000000000008 c000000002ca2468
GPR20: 0000000000000000 0000000000000004 0000000000000009 0000000000000001
GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000003ff14d480
GPR28: 0000000000000001 0000000000000005 c00000008c80f480 c000000061915400
NIP [c0000000001cea60] update_load_avg+0x424/0x48c
LR [c0000000001d7254] enqueue_entity+0x5c/0x5b8
Call Trace:
[c000000003a67760] [c000000003a677d0] 0xc000000003a677d0 (unreliable)
[c000000003a677d0] [c0000000001d7254] enqueue_entity+0x5c/0x5b8
[c000000003a67880] [c0000000001d7918] enqueue_task_fair+0x168/0x7d8
[c000000003a678f0] [c0000000001b9554] enqueue_task+0x5c/0x1c8
[c000000003a67930] [c0000000001c3f40] ttwu_do_activate+0x98/0x2fc
[c000000003a67980] [c0000000001c4460] sched_ttwu_pending+0x2bc/0x72c
[c000000003a67a60] [c0000000002c16ac] __flush_smp_call_function_queue+0x1a0/0x750
[c000000003a67b10] [c00000000005e1c4] smp_ipi_demux_relaxed+0xec/0xf4
[c000000003a67b50] [c000000000057dd4] doorbell_exception+0xe0/0x25c
[c000000003a67b90] [c0000000000383d0] __replay_soft_interrupts+0xf0/0x154
[c000000003a67d40] [c000000000038684] arch_local_irq_restore.part.0+0x1cc/0x214
[c000000003a67d90] [c0000000001b6ec8] finish_task_switch.isra.0+0xb4/0x2f8
[c000000003a67e30] [c00000000110fb9c] __schedule+0x294/0x83c
[c000000003a67ee0] [c0000000011105f0] schedule_idle+0x3c/0x64
[c000000003a67f10] [c0000000001f27f0] do_idle+0x15c/0x1ac
[c000000003a67f60] [c0000000001f2b08] cpu_startup_entry+0x4c/0x50
[c000000003a67f90] [c00000000005ede0] start_secondary+0x284/0x288
[c000000003a67fe0] [c00000000000e058] start_secondary_prolog+0x10/0x14
----------------------------------------------------------------
perf stat -a: ( idle states enabled)
TTWU_QUEUE_DELAYED:
13,612,930 context-switches # 0.000 /sec
912,737 cpu-migrations # 0.000 /sec
1,245 page-faults # 0.000 /sec
449,817,741,085 cycles
137,051,199,092 instructions # 0.30 insn per cycle
25,789,965,217 branches # 0.000 /sec
286,202,628 branch-misses # 1.11% of all branches
NO_TTWU_QUEUE_DELAYED:
24,782,786 context-switches # 0.000 /sec
4,697,384 cpu-migrations # 0.000 /sec
1,250 page-faults # 0.000 /sec
701,934,506,023 cycles
220,728,025,829 instructions # 0.31 insn per cycle
40,271,327,989 branches # 0.000 /sec
474,496,395 branch-misses # 1.18% of all branches
both cycles and instructions are low.
-------------------------------------------------------------------
perf stat -a: ( idle states disabled)
TTWU_QUEUE_DELAYED:
15,402,193 context-switches # 0.000 /sec
1,237,128 cpu-migrations # 0.000 /sec
1,245 page-faults # 0.000 /sec
781,215,992,865 cycles
149,112,303,840 instructions # 0.19 insn per cycle
28,240,010,182 branches # 0.000 /sec
294,485,795 branch-misses # 1.04% of all branches
NO_TTWU_QUEUE_DELAYED:
25,332,898 context-switches # 0.000 /sec
4,756,682 cpu-migrations # 0.000 /sec
1,256 page-faults # 0.000 /sec
781,318,730,494 cycles
220,536,732,094 instructions # 0.28 insn per cycle
40,424,495,545 branches # 0.000 /sec
446,724,952 branch-misses # 1.11% of all branches
Since idle states are disabled, cycles are always spent on CPU. so cycles are more or less, while instruction
differs. Does it mean with feature enabled, is there a lock(maybe rq) for too long?
--------------------------------------------------------------------
Will try to gather more into why is this happening.
Powered by blists - more mailing lists