linux-kernel - Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230503132051.GB1676736@hirez.programming.kicks-ass.net>
Date:   Wed, 3 May 2023 15:20:51 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, Ben Segall <bsegall@...gle.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Crystal Wood <swood@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        John Stultz <jstultz@...gle.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Waiman Long <longman@...hat.com>, Will Deacon <will@...nel.org>
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose
 sched work helpers

On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@...utronix.de>
> 
> schedule() invokes sched_submit_work() before scheduling and
> sched_update_worker() afterwards to ensure that queued block requests are
> flushed and the (IO)worker machineries can instantiate new workers if
> required. This avoids deadlocks and starvation.
> 
> With rt_mutexes this can lead to subtle problem:
> 
>   When rtmutex blocks current::pi_blocked_on points to the rtmutex it
>   blocks on. When one of the functions in sched_submit/resume_work()
>   contends on a rtmutex based lock then that would corrupt
>   current::pi_blocked_on.
> 
> Make it possible to let rtmutex issue the calls outside of the slowpath,
> i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
> 
>   - Exposing sched_submit_work() and moving the task_running() condition
>     into schedule()
> 
>   - Renamimg sched_update_worker() to sched_resume_work() and exposing it
>     too.
> 
>   - Providing sched_rtmutex() which just does the inner loop of scheduling
>     until need_resched() is not longer set. Split out the loop so this does
>     not create yet another copy.
> 
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>

Urgh, so I really don't like this.

The end result is something like:

	rt_mutex_lock()
	  sched_submit_work();
	    // a nested rt_mutex_lock() here will not clobber
	    // ->pi_blocked_on because it's not set yet.

	  task_blocks_on_rt_mutex();
	    tsk->pi_blocked_on = waiter;
	    rt_mutex_enqueue(lock, waiter); <-- the real problem

	  rt_mutex_slowlock_block();
	    schedule_rtmutex();

	  sched_resume_work();

And all of this it not just because tsk->pi_blocked_on, but mostly
because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
enqueue thing is what makes the 'simple' solution of saving/restoring
tsk->pi_blocked_on not work.

Basically the pi_blocked_on curruption is a side effect, not the
fundamental issue. One task having two waiters registered is the bigger
issue.

Now, sched_submit_work() could also use (regular) mutex -- after all
it's a fully preemptible context. And then we're subject to the 'same'
problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

This means that strictly speaking we should litter mutex with the same
thing :/

This all feels fragile to me. Too many things spread out in too many
places. An alternative is something like:

void __sched schedule_pi(void)
{
	struct task_struct *tsk = current;
	void *waiter = tsk->pi_blocked_on;

	sched_submit_work(tsk);
	do {
		preempt_disable();
		if (rt_mutex_blocks(tsk, waiter))
			schedule();
		sched_preempt_enable_no_resched();
	} while (need_resched());
	sched_update_worker(tsk);
}

And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
thing. However, this is going to be a massive reorg of the rt_mutex code
and I'm not entirely sure the end result will actually be better... it
might just make a mess elsewhere :/

> +void sched_submit_work(void)
>  {
> -	unsigned int task_flags;
> +	struct task_struct *tsk = current;
> +	unsigned int task_flags = tsk->flags;
>  
> -	if (task_is_running(tsk))
> -		return;
> -
> -	task_flags = tsk->flags;
>  	/*
>  	 * If a worker goes to sleep, notify and ask workqueue whether it
>  	 * wants to wake up a task to maintain concurrency.
> @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  	blk_flush_plug(tsk->plug, true);
>  }

> +asmlinkage __visible void __sched schedule(void)
> +{
> +	if (!task_is_running(current))
> +		sched_submit_work();
> +	schedule_loop(SM_NONE);
> +	sched_resume_work();
>  }
>  EXPORT_SYMBOL(schedule);

pulling out task_is_running() like this is going to get into conflict
with TJs patches here:

  https://lkml.kernel.org/r/20230418205159.724789-1-tj@kernel.org

That makes sched_submit_work() do things even if task_is_running().