linux-kernel - Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c9bdc8611db65d47041ffc485c8da0b7fbd12184.camel@linux.vnet.ibm.com>
Date:   Thu, 07 Dec 2023 11:30:13 +0530
From:   Aboorva Devarajan <aboorvad@...ux.vnet.ibm.com>
To:     David Vernet <void@...ifault.com>
Cc:     peterz@...radead.org, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        kprateek.nayak@....com, aaron.lu@...el.com,
        wuyun.abel@...edance.com, kernel-team@...a.com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

On Mon, 2023-11-27 at 13:49 -0600, David Vernet wrote:
> On Mon, Nov 27, 2023 at 01:58:34PM +0530, Aboorva Devarajan wrote:
> > On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:
> > 
> > Hi David,
> > 
> > I have been benchmarking the patch-set on POWER9 machine to understand
> > its impact. However, I've run into a recurring hard-lockups in
> > newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
> > doesn't happen all the time, but it's something worth noting. I wanted
> > to inform you about this, and I can provide more details if needed.
> 
> Hello Aboorva,
> 
> Thank you for testing out this patch set and for the report. One issue
> that v4 will correct is that the shared_runq list could become corrupted
> if you enable and disable the feature, as a stale task could remain in
> the list after the feature has been disabled. I'll be including a fix
> for that in v4, which I'm currently benchmarking, but other stuff keeps
> seeming to preempt it.

Hi David,

Thank you for your response. While testing, I did observe the
shared_runq list becoming corrupted when enabling and disabling the
feature. 

Please find the logs below with CONFIG_DEBUG_LIST enabled:
------------------------------------------

[ 4952.270819] list_add corruption. prev->next should be next (c0000003fae87a80), but was c0000000ba027ec8. (prev=c0000000ba027ec8).
[ 4952.270926] ------------[ cut here ]------------
[ 4952.270935] kernel BUG at lib/list_debug.c:30!
[ 4952.270947] Oops: Exception in kernel mode, sig: 5 [#1]
[ 4952.270956] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 4952.271029] CPU: 10 PID: 31426 Comm: cc1 Kdump: loaded Not tainted 6.5.0-rc2+ #1
[ 4952.271042] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
[ 4952.271054] NIP:  c000000000872f88 LR: c000000000872f84 CTR: 00000000006d1a1c
[ 4952.271070] REGS: c00000006e1b34e0 TRAP: 0700   Not tainted  (6.5.0-rc2+)
[ 4952.271079] MSR:  8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 28048222  XER: 00000006
[ 4952.271102] CFAR: c0000000001ffa24 IRQMASK: 1 
[ 4952.271102] GPR00: c000000000872f84 c00000006e1b3780 c0000000019a3b00 0000000000000075 
[ 4952.271102] GPR04: c0000003faff2c08 c0000003fb077e80 c00000006e1b35c8 00000003f8e70000 
[ 4952.271102] GPR08: 0000000000000027 c000000002185f30 00000003f8e70000 0000000000000001 
[ 4952.271102] GPR12: 0000000000000000 c0000003fffe2c80 c000000068ecb100 0000000000000000 
[ 4952.271102] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271102] GPR20: 0000000000000000 0000000000000000 0000000000000041 c00000006e1b3bb0 
[ 4952.271102] GPR24: c000000002c72058 00000003f8e70000 0000000000000001 c00000000e919948 
[ 4952.271102] GPR28: c0000000ba027ec8 c0000003fae87a80 c000000080ce6c00 c00000000e919980 
[ 4952.271212] NIP [c000000000872f88] __list_add_valid+0xb8/0x100
[ 4952.271236] LR [c000000000872f84] __list_add_valid+0xb4/0x100
[ 4952.271248] Call Trace:
[ 4952.271254] [c00000006e1b3780] [c000000000872f84] __list_add_valid+0xb4/0x100 (unreliable)
[ 4952.271270] [c00000006e1b37e0] [c0000000001b8f50] __enqueue_entity+0x110/0x1c0
[ 4952.271288] [c00000006e1b3830] [c0000000001bec9c] enqueue_entity+0x16c/0x690
[ 4952.271301] [c00000006e1b38e0] [c0000000001bf280] enqueue_task_fair+0xc0/0x490
[ 4952.271315] [c00000006e1b3980] [c0000000001ada0c] ttwu_do_activate+0xac/0x410
[ 4952.271328] [c00000006e1b3a10] [c0000000001ae59c] try_to_wake_up+0x5fc/0x8b0
[ 4952.271341] [c00000006e1b3ae0] [c0000000001df6dc] autoremove_wake_function+0x2c/0xc0
[ 4952.271359] [c00000006e1b3b20] [c0000000001e1018] __wake_up_common+0xc8/0x240
[ 4952.271372] [c00000006e1b3b90] [c0000000001e123c] __wake_up_common_lock+0xac/0x120
[ 4952.271385] [c00000006e1b3c20] [c0000000005bd4a4] pipe_write+0xd4/0x980
[ 4952.271401] [c00000006e1b3d00] [c0000000005ad720] vfs_write+0x350/0x4b0
[ 4952.271420] [c00000006e1b3dc0] [c0000000005adc24] ksys_write+0xf4/0x140
[ 4952.271433] [c00000006e1b3e10] [c000000000031108] system_call_exception+0x128/0x340
[ 4952.271449] [c00000006e1b3e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
[ 4952.271470] --- interrupt: 3000 at 0x7fff8df3aa34
[ 4952.271482] NIP:  00007fff8df3aa34 LR: 0000000000000000 CTR: 0000000000000000
[ 4952.271492] REGS: c00000006e1b3e80 TRAP: 3000   Not tainted  (6.5.0-rc2+)
[ 4952.271502] MSR:  800000000000f033 <SF,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44002822  XER: 00000000
[ 4952.271526] IRQMASK: 0 
[ 4952.271526] GPR00: 0000000000000004 00007fffea094d00 0000000112467a00 0000000000000001 
[ 4952.271526] GPR04: 0000000132c6a810 0000000000002000 00000000000004e4 0000000000000036 
[ 4952.271526] GPR08: 0000000132c6c810 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR12: 0000000000000000 00007fff8e71cac0 0000000000000000 0000000000000000 
[ 4952.271526] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR20: 00007fffea09c76f 00000001123b6898 0000000000000003 0000000132c6c820 
[ 4952.271526] GPR24: 0000000112469d88 00000001124686b8 0000000132c6a810 0000000000002000 
[ 4952.271526] GPR28: 0000000000002000 00007fff8e0418e0 0000000132c6a810 0000000000002000 
[ 4952.271627] NIP [00007fff8df3aa34] 0x7fff8df3aa34
[ 4952.271635] LR [0000000000000000] 0x0
[ 4952.271642] --- interrupt: 3000
[ 4952.271648] Code: f8010070 4b98ca81 60000000 0fe00000 7c0802a6 3c62ffa6 7d064378 7d244b78 38637f68 f8010070 4b98ca5d 60000000 <0fe00000> 7c0802a6 3c62ffa6 7ca62b78 
[ 4952.271685] ---[ end trace 0000000000000000 ]---
[ 4952.282562] pstore: backend (nvram) writing error (-1)
------------------------------------------

> 
> By any chance, did you run into this when you were enabling / disabling
> the feature? Or did you just enable it once and then hit this issue
> after some time, which would indicate a different issue? I'm trying to
> repro using ab, but haven't been successful thus far. If you're able to
> repro consistently, it might be useful to run with CONFIG_LIST_DEBUG=y.
> 

Additionally, I noticed a sporadic issue persisting even after enabling
the feature once, and the issue surfaced over time.  However, it
occurred specifically on a particular system, and my attempts to
recreate it were unsuccessful. I will provide more details if I can
successfully reproduce the issue with debug enabled. But looks like the
primary issue revolves around the shared_runq list getting corrupted
when toggling the feature on and off repeatedly as you pointed out.

I will keep an eye out for v4 and test if it's available later.

Thanks,
Aboorva


> Thanks,
> David
> 
> > -----------------------------------------
> > 
> > Some inital information regarding the hard-lockup:
> > 
> > Base Kernel:
> > -----------
> > 
> > Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
> > tick_stop when cfs bandwidth in use").
> > 
> > Patched Kernel:
> > -------------
> > 
> > Base Kernel + v3 (shared runqueue patch-set)(
> > https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
> > )
> > 
> > The hard-lockup moslty occurs when running the Apache2 benchmarks
> > with
> > ab (Apache HTTP benchmarking tool) on the patched kernel. However,
> > this
> > problem is not exclusive to the mentioned benchmark and only occurs
> > while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ
> > feature
> > prevents the occurrence of the lockup.
> > 
> > ab (Apache HTTP benchmarking tool): 
> > https://httpd.apache.org/docs/2.4/programs/ab.html
> > 
> > Hardlockup with Patched Kernel:
> > ------------------------------
> > 
> > [ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
> > [ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
> > [ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
> > [ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
> > [ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
> > [ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
> > [ 3295.182437][  C123] Call Trace:
> > [ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
> > [ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
> > [ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
> > [ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
> > [ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
> > [ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
> > [ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
> > [ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> > [ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:
> > 
> > -----------------------------------------
> > 
> > [ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
> > [ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
> > [ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
> > [ 3943.438678][  C112] Workqueue:  0x0 (events)
> > [ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
> > [ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
> > [ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
> > [ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
> > [ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
> > [ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
> > [ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
> > [ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
> > [ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
> > [ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
> > [ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
> > [ 3943.438727][  C112] Call Trace:
> > [ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
> > [ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
> > [ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
> > [ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
> > [ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001
> > 
> > -----------------------------------------
> > 
> > System configuration:
> > --------------------
> > 
> > # lscpu
> > Architecture:                    ppc64le
> > Byte Order:                      Little Endian
> > CPU(s):                          128
> > On-line CPU(s) list:             0-127
> > Thread(s) per core:              4
> > Core(s) per socket:              16
> > Socket(s):                       2
> > NUMA node(s):                    8
> > Model:                           2.3 (pvr 004e 1203)
> > Model name:                      POWER9 (raw), altivec supported
> > Frequency boost:                 enabled
> > CPU max MHz:                     3800.0000
> > CPU min MHz:                     2300.0000
> > L1d cache:                       1 MiB
> > L1i cache:                       1 MiB
> > NUMA node0 CPU(s):               64-127
> > NUMA node8 CPU(s):               0-63
> > NUMA node250 CPU(s):             
> > NUMA node251 CPU(s):             
> > NUMA node252 CPU(s):             
> > NUMA node253 CPU(s):             
> > NUMA node254 CPU(s):             
> > NUMA node255 CPU(s):             
> > 
> > # uname -r
> > 6.5.0-rc2-00028-g7475adccd76b
> > 
> > # cat /sys/kernel/debug/sched/features
> > GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
> > CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
> > NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
> > RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
> > WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
> > BASE_SLICE HZ_BW SHARED_RUNQ
> > 
> > -----------------------------------------
> > 
> > Please let me know if I've missed anything here. I'll continue
> > investigating and share any additional information I find.
> > 
> > Thanks and Regards,
> > Aboorva
> >