[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230503132051.GB1676736@hirez.programming.kicks-ass.net>
Date: Wed, 3 May 2023 15:20:51 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Ben Segall <bsegall@...gle.com>,
Boqun Feng <boqun.feng@...il.com>,
Crystal Wood <swood@...hat.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>,
John Stultz <jstultz@...gle.com>,
Juri Lelli <juri.lelli@...hat.com>,
Mel Gorman <mgorman@...e.de>,
Steven Rostedt <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>,
Valentin Schneider <vschneid@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Waiman Long <longman@...hat.com>, Will Deacon <will@...nel.org>
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose
sched work helpers
On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@...utronix.de>
>
> schedule() invokes sched_submit_work() before scheduling and
> sched_update_worker() afterwards to ensure that queued block requests are
> flushed and the (IO)worker machineries can instantiate new workers if
> required. This avoids deadlocks and starvation.
>
> With rt_mutexes this can lead to subtle problem:
>
> When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> blocks on. When one of the functions in sched_submit/resume_work()
> contends on a rtmutex based lock then that would corrupt
> current::pi_blocked_on.
>
> Make it possible to let rtmutex issue the calls outside of the slowpath,
> i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
>
> - Exposing sched_submit_work() and moving the task_running() condition
> into schedule()
>
> - Renamimg sched_update_worker() to sched_resume_work() and exposing it
> too.
>
> - Providing sched_rtmutex() which just does the inner loop of scheduling
> until need_resched() is not longer set. Split out the loop so this does
> not create yet another copy.
>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Urgh, so I really don't like this.
The end result is something like:
rt_mutex_lock()
sched_submit_work();
// a nested rt_mutex_lock() here will not clobber
// ->pi_blocked_on because it's not set yet.
task_blocks_on_rt_mutex();
tsk->pi_blocked_on = waiter;
rt_mutex_enqueue(lock, waiter); <-- the real problem
rt_mutex_slowlock_block();
schedule_rtmutex();
sched_resume_work();
And all of this it not just because tsk->pi_blocked_on, but mostly
because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
enqueue thing is what makes the 'simple' solution of saving/restoring
tsk->pi_blocked_on not work.
Basically the pi_blocked_on curruption is a side effect, not the
fundamental issue. One task having two waiters registered is the bigger
issue.
Now, sched_submit_work() could also use (regular) mutex -- after all
it's a fully preemptible context. And then we're subject to the 'same'
problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
This means that strictly speaking we should litter mutex with the same
thing :/
This all feels fragile to me. Too many things spread out in too many
places. An alternative is something like:
void __sched schedule_pi(void)
{
struct task_struct *tsk = current;
void *waiter = tsk->pi_blocked_on;
sched_submit_work(tsk);
do {
preempt_disable();
if (rt_mutex_blocks(tsk, waiter))
schedule();
sched_preempt_enable_no_resched();
} while (need_resched());
sched_update_worker(tsk);
}
And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
thing. However, this is going to be a massive reorg of the rt_mutex code
and I'm not entirely sure the end result will actually be better... it
might just make a mess elsewhere :/
> +void sched_submit_work(void)
> {
> - unsigned int task_flags;
> + struct task_struct *tsk = current;
> + unsigned int task_flags = tsk->flags;
>
> - if (task_is_running(tsk))
> - return;
> -
> - task_flags = tsk->flags;
> /*
> * If a worker goes to sleep, notify and ask workqueue whether it
> * wants to wake up a task to maintain concurrency.
> @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> blk_flush_plug(tsk->plug, true);
> }
> +asmlinkage __visible void __sched schedule(void)
> +{
> + if (!task_is_running(current))
> + sched_submit_work();
> + schedule_loop(SM_NONE);
> + sched_resume_work();
> }
> EXPORT_SYMBOL(schedule);
pulling out task_is_running() like this is going to get into conflict
with TJs patches here:
https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org
That makes sched_submit_work() do things even if task_is_running().
Powered by blists - more mailing lists