[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z5gLmsbE73PYPd-Q@slm.duckdns.org>
Date: Mon, 27 Jan 2025 12:41:30 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5] sched_ext: Fix lock imbalance in
dispatch_to_local_dsq()
On Mon, Jan 27, 2025 at 11:06:16PM +0100, Andrea Righi wrote:
> While performing the rq locking dance in dispatch_to_local_dsq(), we may
> trigger the following lock imbalance condition, in particular when
> multiple tasks are rapidly changing CPU affinity (i.e., running a
> `stress-ng --race-sched 0`):
>
> [ 13.413579] =====================================
> [ 13.413660] WARNING: bad unlock balance detected!
> [ 13.413729] 6.13.0-virtme #15 Not tainted
> [ 13.413792] -------------------------------------
> [ 13.413859] kworker/1:1/80 is trying to release lock (&rq->__lock) at:
> [ 13.413954] [<ffffffff873c6c48>] dispatch_to_local_dsq+0x108/0x1a0
> [ 13.414111] but there are no more locks to release!
> [ 13.414176]
> [ 13.414176] other info that might help us debug this:
> [ 13.414258] 1 lock held by kworker/1:1/80:
> [ 13.414318] #0: ffff8b66feb41698 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0x90
> [ 13.414612]
> [ 13.414612] stack backtrace:
> [ 13.415255] CPU: 1 UID: 0 PID: 80 Comm: kworker/1:1 Not tainted 6.13.0-virtme #15
> [ 13.415505] Workqueue: 0x0 (events)
> [ 13.415567] Sched_ext: dsp_local_on (enabled+all), task: runnable_at=-2ms
> [ 13.415570] Call Trace:
> [ 13.415700] <TASK>
> [ 13.415744] dump_stack_lvl+0x78/0xe0
> [ 13.415806] ? dispatch_to_local_dsq+0x108/0x1a0
> [ 13.415884] print_unlock_imbalance_bug+0x11b/0x130
> [ 13.415965] ? dispatch_to_local_dsq+0x108/0x1a0
> [ 13.416226] lock_release+0x231/0x2c0
> [ 13.416326] _raw_spin_unlock+0x1b/0x40
> [ 13.416422] dispatch_to_local_dsq+0x108/0x1a0
> [ 13.416554] flush_dispatch_buf+0x199/0x1d0
> [ 13.416652] balance_one+0x194/0x370
> [ 13.416751] balance_scx+0x61/0x1e0
> [ 13.416848] prev_balance+0x43/0xb0
> [ 13.416947] __pick_next_task+0x6b/0x1b0
> [ 13.417052] __schedule+0x20d/0x1740
>
> This happens because dispatch_to_local_dsq() is racing with
> dispatch_dequeue() and, when the latter wins, we incorrectly assume that
> the task has been moved to dst_rq.
>
> Fix by properly tracking the currently locked rq.
>
> Fixes: 4d3ca89bdd31 ("sched_ext: Refactor consume_remote_task()")
> Signed-off-by: Andrea Righi <arighi@...dia.com>
Applied to sched_ext/for-6.14-fixes.
Thanks.
--
tejun
Powered by blists - more mailing lists