linux-kernel - Re: [PATCH v2 00/12] sched: Address schbench regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6c0c135-9d8f-4d9d-8fc5-bc703cac9bdb@linux.ibm.com>
Date: Mon, 14 Jul 2025 23:24:36 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        vschneid@...hat.com, clm@...a.com
Subject: Re: [PATCH v2 00/12] sched: Address schbench regression



On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
> 
Some more data from the regression. I am looking at rps numbers
while running ./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5.
All the data is from an LPAR(VM) with 5 cores.


echo TTWU_QUEUE_DELAYED > features
average rps: 970491.00

echo NO_TTWU_QUEUE_DELAYED > features
current rps: 1555456.78

So below data points are with feature enabled or disabled with series applied + clm's patch.
-------------------------------------------------------
./hardirqs

TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
env2                               816
IPI-2                          1421603       << IPI are less compared to with feature.


NO_TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
ibmvscsi                             8
env2                               266
IPI-2                          6489980

-------------------------------------------------------

Disabled all the idle states. Regression still exits.

-------------------------------------------------------

See this warning everytime i run schbench:  This happens with PATCH 12/12 only.

It is triggering this warning. Some clock update is getting messed up?

1637 static inline void assert_clock_updated(struct rq *rq)
1638 {
1639         /*
1640          * The only reason for not seeing a clock update since the
1641          * last rq_pin_lock() is if we're currently skipping updates.
1642          */
1643         WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
1644 }
  

WARNING: kernel/sched/sched.h:1643 at update_load_avg+0x424/0x48c, CPU#6: swapper/6/0
CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Kdump: loaded Not tainted 6.16.0-rc4+ #276 PREEMPT(voluntary)
NIP:  c0000000001cea60 LR: c0000000001d7254 CTR: c0000000001d77b0
REGS: c000000003a674c0 TRAP: 0700   Not tainted  (6.16.0-rc4+)
MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 28008208  XER: 20040000
CFAR: c0000000001ce68c IRQMASK: 3
GPR00: c0000000001d7254 c000000003a67760 c000000001bc8100 c000000061915400
GPR04: c00000008c80f480 0000000000000005 c000000003a679b0 0000000000000000
GPR08: 0000000000000001 0000000000000000 c0000003ff14d480 0000000000004000
GPR12: c0000000001d77b0 c0000003ffff7880 0000000000000000 000000002eef18c0
GPR16: 0000000000000006 0000000000000006 0000000000000008 c000000002ca2468
GPR20: 0000000000000000 0000000000000004 0000000000000009 0000000000000001
GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000003ff14d480
GPR28: 0000000000000001 0000000000000005 c00000008c80f480 c000000061915400
NIP [c0000000001cea60] update_load_avg+0x424/0x48c
LR [c0000000001d7254] enqueue_entity+0x5c/0x5b8
Call Trace:
[c000000003a67760] [c000000003a677d0] 0xc000000003a677d0 (unreliable)
[c000000003a677d0] [c0000000001d7254] enqueue_entity+0x5c/0x5b8
[c000000003a67880] [c0000000001d7918] enqueue_task_fair+0x168/0x7d8
[c000000003a678f0] [c0000000001b9554] enqueue_task+0x5c/0x1c8
[c000000003a67930] [c0000000001c3f40] ttwu_do_activate+0x98/0x2fc
[c000000003a67980] [c0000000001c4460] sched_ttwu_pending+0x2bc/0x72c
[c000000003a67a60] [c0000000002c16ac] __flush_smp_call_function_queue+0x1a0/0x750
[c000000003a67b10] [c00000000005e1c4] smp_ipi_demux_relaxed+0xec/0xf4
[c000000003a67b50] [c000000000057dd4] doorbell_exception+0xe0/0x25c
[c000000003a67b90] [c0000000000383d0] __replay_soft_interrupts+0xf0/0x154
[c000000003a67d40] [c000000000038684] arch_local_irq_restore.part.0+0x1cc/0x214
[c000000003a67d90] [c0000000001b6ec8] finish_task_switch.isra.0+0xb4/0x2f8
[c000000003a67e30] [c00000000110fb9c] __schedule+0x294/0x83c
[c000000003a67ee0] [c0000000011105f0] schedule_idle+0x3c/0x64
[c000000003a67f10] [c0000000001f27f0] do_idle+0x15c/0x1ac
[c000000003a67f60] [c0000000001f2b08] cpu_startup_entry+0x4c/0x50
[c000000003a67f90] [c00000000005ede0] start_secondary+0x284/0x288
[c000000003a67fe0] [c00000000000e058] start_secondary_prolog+0x10/0x14

----------------------------------------------------------------

perf stat -a:  ( idle states enabled)

TTWU_QUEUE_DELAYED:

         13,612,930      context-switches                 #    0.000 /sec
            912,737      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    449,817,741,085      cycles
    137,051,199,092      instructions                     #    0.30  insn per cycle
     25,789,965,217      branches                         #    0.000 /sec
        286,202,628      branch-misses                    #    1.11% of all branches

NO_TTWU_QUEUE_DELAYED:

         24,782,786      context-switches                 #    0.000 /sec
          4,697,384      cpu-migrations                   #    0.000 /sec
              1,250      page-faults                      #    0.000 /sec
    701,934,506,023      cycles
    220,728,025,829      instructions                     #    0.31  insn per cycle
     40,271,327,989      branches                         #    0.000 /sec
        474,496,395      branch-misses                    #    1.18% of all branches

both cycles and instructions are low.

-------------------------------------------------------------------

perf stat -a:  ( idle states disabled)

TTWU_QUEUE_DELAYED:
            
         15,402,193      context-switches                 #    0.000 /sec
          1,237,128      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    781,215,992,865      cycles
    149,112,303,840      instructions                     #    0.19  insn per cycle
     28,240,010,182      branches                         #    0.000 /sec
        294,485,795      branch-misses                    #    1.04% of all branches

NO_TTWU_QUEUE_DELAYED:

         25,332,898      context-switches                 #    0.000 /sec
          4,756,682      cpu-migrations                   #    0.000 /sec
              1,256      page-faults                      #    0.000 /sec
    781,318,730,494      cycles
    220,536,732,094      instructions                     #    0.28  insn per cycle
     40,424,495,545      branches                         #    0.000 /sec
        446,724,952      branch-misses                    #    1.11% of all branches

Since idle states are disabled, cycles are always spent on CPU. so cycles are more or less, while instruction
differs. Does it mean with feature enabled, is there a lock(maybe rq) for too long?

--------------------------------------------------------------------

Will try to gather more into why is this happening.