[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <02811bd7-b401-4e16-bb7d-4edeb0b89ffd@arm.com>
Date: Thu, 18 Sep 2025 16:15:45 +0100
From: Christian Loehle <christian.loehle@....com>
To: Peter Zijlstra <peterz@...radead.org>, tj@...nel.org
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
longman@...hat.com, hannes@...xchg.org, mkoutny@...e.com,
void@...ifault.com, arighi@...dia.com, changwoo@...lia.com,
cgroups@...r.kernel.org, sched-ext@...ts.linux.dev, liuwenfang@...or.com,
tglx@...utronix.de
Subject: Re: [PATCH 00/14] sched: Support shared runqueue locking
On 9/10/25 16:44, Peter Zijlstra wrote:
> Hi,
>
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
>
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
>
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
>
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
>
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
>
> Patches are also available at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
>
>
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net
>
>
> ---
> include/linux/cleanup.h | 5 +
> include/linux/sched.h | 6 +-
> kernel/cgroup/cpuset.c | 2 +-
> kernel/kthread.c | 15 +-
> kernel/sched/core.c | 370 +++++++++++++++++++++--------------------------
> kernel/sched/deadline.c | 26 ++--
> kernel/sched/ext.c | 104 +++++++------
> kernel/sched/fair.c | 23 ++-
> kernel/sched/idle.c | 14 +-
> kernel/sched/rt.c | 13 +-
> kernel/sched/sched.h | 225 ++++++++++++++++++++--------
> kernel/sched/stats.h | 2 +-
> kernel/sched/stop_task.c | 14 +-
> kernel/sched/syscalls.c | 80 ++++------
> 14 files changed, 495 insertions(+), 404 deletions(-)
>
>
Hi Peter, A couple of issues popped up when testing this [1] (that don't trigger on [2]):
When booting (arm64 orion o6) I get:
[ 1.298020] sched: DL replenish lagged too much
[ 1.298364] ------------[ cut here ]------------
[ 1.298377] WARNING: CPU: 4 PID: 0 at kernel/sched/deadline.c:239 inactive_task_timer+0x3d0/0x474
[ 1.298413] Modules linked in:
[ 1.298436] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G S 6.17.0-rc4-cix-build+ #56 PREEMPT
[ 1.298455] Tainted: [S]=CPU_OUT_OF_SPEC
[ 1.298463] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[ 1.298473] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1.298486] pc : inactive_task_timer+0x3d0/0x474
[ 1.298505] lr : inactive_task_timer+0x394/0x474
[ 1.298522] sp : ffff800083d4be00
[ 1.298530] x29: ffff800083d4be20 x28: ffff00008362d888 x27: ffff800082ab1f88
[ 1.298561] x26: ffff800082ab4a98 x25: ffff0001fef50c18 x24: 0000000000019999
[ 1.298589] x23: 000000000000cccc x22: ffff0001fef51708 x21: ffff00008362d640
[ 1.298616] x20: ffff0001fef50c00 x19: ffff00008362d7f0 x18: fffffffffff0b580
[ 1.298642] x17: ffff80017c966000 x16: ffff800083d48000 x15: 0000000000000028
[ 1.298669] x14: 0000000000000000 x13: 00000000000c4000 x12: 00000000000000c5
[ 1.298695] x11: 0000000000004bb8 x10: 0000000000004bb8 x9 : 0000000000000000
[ 1.298722] x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff0001fef51bc0
[ 1.298747] x5 : ffff0001fef50c00 x4 : 00000000000000cc x3 : 0000000000000000
[ 1.298773] x2 : ffff80017c966000 x1 : 0000000000000000 x0 : ffffffffffff3333
[ 1.298800] Call trace:
[ 1.298808] inactive_task_timer+0x3d0/0x474 (P)
[ 1.298830] __hrtimer_run_queues+0x3c4/0x440
[ 1.298852] hrtimer_interrupt+0xe4/0x244
[ 1.298871] arch_timer_handler_phys+0x2c/0x44
[ 1.298893] handle_percpu_devid_irq+0xa8/0x1f0
[ 1.298916] handle_irq_desc+0x40/0x58
[ 1.298933] generic_handle_domain_irq+0x1c/0x28
[ 1.298950] gic_handle_irq+0x4c/0x11c
[ 1.298965] call_on_irq_stack+0x30/0x48
[ 1.298982] do_interrupt_handler+0x80/0x84
[ 1.299001] el1_interrupt+0x34/0x64
[ 1.299022] el1h_64_irq_handler+0x18/0x24
[ 1.299037] el1h_64_irq+0x6c/0x70
[ 1.299052] finish_task_switch.isra.0+0xac/0x2bc (P)
[ 1.299070] __schedule+0x45c/0xffc
[ 1.299088] schedule_idle+0x28/0x48
[ 1.299104] do_idle+0x184/0x27c
[ 1.299121] cpu_startup_entry+0x34/0x3c
[ 1.299137] secondary_start_kernel+0x134/0x154
[ 1.299158] __secondary_switched+0xc0/0xc4
[ 1.299179] irq event stamp: 1634
[ 1.299189] hardirqs last enabled at (1633): [<ffff800081486354>] el1_interrupt+0x54/0x64
[ 1.299210] hardirqs last disabled at (1634): [<ffff800081486324>] el1_interrupt+0x24/0x64
[ 1.299229] softirqs last enabled at (1614): [<ffff8000800bf7b0>] handle_softirqs+0x4a0/0x4b8
[ 1.299248] softirqs last disabled at (1609): [<ffff800080010600>] __do_softirq+0x14/0x20
[ 1.299262] ---[ end trace 0000000000000000 ]---
and when running actual tests (e.g. iterating through all scx schedulers under load):
[ 146.532691] ================================
[ 146.536947] WARNING: inconsistent lock state
[ 146.541204] 6.17.0-rc4-cix-build+ #58 Tainted: G S W
[ 146.547457] --------------------------------
[ 146.551713] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 146.557705] rcu_tasks_trace/79 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 146.563438] ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194
[ 146.571179] {IN-HARDIRQ-W} state was registered at:
[ 146.576042] lock_acquire+0x1c8/0x338
[ 146.579788] _raw_spin_lock+0x48/0x60
[ 146.583536] dispatch_enqueue+0x130/0x3e8
[ 146.587632] do_enqueue_task+0x2f0/0x464
[ 146.591629] enqueue_task_scx+0x1b0/0x290
[ 146.595712] enqueue_task+0x84/0x18c
[ 146.599360] ttwu_do_activate+0x84/0x25c
[ 146.603361] try_to_wake_up+0x310/0x5f8
[ 146.607271] wake_up_process+0x18/0x24
[ 146.611094] kick_pool+0x9c/0x17c
[ 146.614483] __queue_work+0x544/0x7a8
[ 146.618223] __queue_delayed_work+0x118/0x15c
[ 146.622653] mod_delayed_work_on+0xcc/0xe0
[ 146.626823] kblockd_mod_delayed_work_on+0x20/0x30
[ 146.631696] blk_mq_kick_requeue_list+0x1c/0x28
[ 146.636307] blk_flush_complete_seq+0xd4/0x2a4
[ 146.640824] flush_end_io+0x1e0/0x3f4
[ 146.644559] blk_mq_end_request+0x60/0x154
[ 146.648733] nvme_end_req+0x30/0x78
[ 146.652306] nvme_complete_rq+0x7c/0x218
[ 146.656302] nvme_pci_complete_rq+0x98/0x110
[ 146.660650] nvme_poll_cq+0x1cc/0x3b4
[ 146.664385] nvme_irq+0x34/0x88
[ 146.667600] __handle_irq_event_percpu+0x88/0x304
[ 146.672384] handle_irq_event+0x4c/0xa8
[ 146.676293] handle_fasteoi_irq+0x108/0x20c
[ 146.680555] handle_irq_desc+0x40/0x58
[ 146.684378] generic_handle_domain_irq+0x1c/0x28
[ 146.689068] gic_handle_irq+0x4c/0x11c
[ 146.692891] call_on_irq_stack+0x30/0x48
[ 146.696891] do_interrupt_handler+0x80/0x84
[ 146.701151] el1_interrupt+0x34/0x64
[ 146.704810] el1h_64_irq_handler+0x18/0x24
[ 146.708979] el1h_64_irq+0x6c/0x70
[ 146.712453] cpuidle_enter_state+0x12c/0x53c
[ 146.716796] cpuidle_enter+0x38/0x50
[ 146.720458] do_idle+0x204/0x27c
[ 146.723759] cpu_startup_entry+0x38/0x3c
[ 146.727755] secondary_start_kernel+0x134/0x154
[ 146.732370] __secondary_switched+0xc0/0xc4
[ 146.736638] irq event stamp: 1754
[ 146.739938] hardirqs last enabled at (1753): [<ffff800081497184>] _raw_spin_unlock_irqrestore+0x6c/0x70
[ 146.749405] hardirqs last disabled at (1754): [<ffff8000814965e4>] _raw_spin_lock_irqsave+0x84/0x88
[ 146.758437] softirqs last enabled at (1664): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394
[ 146.767476] softirqs last disabled at (1660): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394
[ 146.776506]
[ 146.776506] other info that might help us debug this:
[ 146.783019] Possible unsafe locking scenario:
[ 146.783019]
[ 146.788923] CPU0
[ 146.791356] ----
[ 146.793788] lock(&dsq->lock);
[ 146.796915] <Interrupt>
[ 146.799521] lock(&dsq->lock);
[ 146.802821]
[ 146.802821] *** DEADLOCK ***
[ 146.802821]
[ 146.808725] 3 locks held by rcu_tasks_trace/79:
[ 146.813242] #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570
[ 146.823403] #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c
[ 146.832014] #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194
[ 146.840178]
[ 146.813242] #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570
[ 146.823403] #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c
[ 146.832014] #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194
[ 146.840178]
[ 146.840178] stack backtrace:
[ 146.844521] CPU: 10 UID: 0 PID: 79 Comm: rcu_tasks_trace Tainted: G S W 6.17.0-rc4-cix-build+ #58 PREEMPT
[ 146.855463] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 146.860240] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[ 146.872136] Sched_ext: simple (enabled+all), task: runnable_at=-4ms
[ 146.872138] Call trace:
[ 146.880822] show_stack+0x18/0x24 (C)
[ 146.884471] dump_stack_lvl+0x90/0xd0
[ 146.888131] dump_stack+0x18/0x24
[ 146.891432] print_usage_bug.part.0+0x29c/0x364
[ 146.895950] mark_lock+0x778/0x978
[ 146.899338] mark_held_locks+0x58/0x90
[ 146.903074] lockdep_hardirqs_on_prepare+0x100/0x210
[ 146.908025] trace_hardirqs_on+0x5c/0x1cc
[ 146.912025] _raw_spin_unlock_irqrestore+0x6c/0x70
[ 146.916803] task_call_func+0x110/0x164
[ 146.920625] trc_wait_for_one_reader.part.0+0x5c/0x3b8
[ 146.925750] check_all_holdout_tasks_trace+0x124/0x480
[ 146.930874] rcu_tasks_wait_gp+0x1f0/0x3b4
[ 146.934957] rcu_tasks_one_gp+0x4a4/0x570
[ 146.938953] rcu_tasks_kthread+0xd4/0xe0
[ 146.942862] kthread+0x148/0x208
[ 146.946079] ret_from_fork+0x10/0x20
(This actually locks up the system without any further print FWIW).
I'll keep testing and start debugging now, but if I can help you with anything immediately, please
do shout.
[1]
This referring to sched/cleanup at time of writing:
e127838bf8f9 sched: Cleanup NOCLOCK
ce024feefe1c sched/ext: Implement p->srq_lock support
6ef342071dd7 sched: Add {DE,EN}QUEUE_LOCKED
ed738ce6f9fb sched: Add shared runqueue locking to __task_rq_lock()
94f197f28834 sched: Add flags to {put_prev,set_next}_task() methods
254d43c94105 sched: Add locking comments to sched_class methods
f8864b505a17 sched: Make __do_set_cpus_allowed() use the sched_change pattern
d0e9cfb835d3 sched: Rename do_set_cpus_allowed()
cfcabf45249d sched: Fix do_set_cpus_allowed() locking
f7b9b39041fb sched: Fix migrate_disable_switch() locking
91128b33456a sched: Move sched_class::prio_changed() into the change pattern
c59dc6ce071b sched: Cleanup sched_delayed handling for class switches
13ea43940095 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
f0b336327a1b sched: Re-arrange the {EN,DE}QUEUE flags
b55442cb4ec1 sched: Employ sched_change guards
[2]
5b726e9bf954 sched/fair: Get rid of throttled_lb_pair()
Powered by blists - more mailing lists