linux-kernel - Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241216165419.GE35539@noisy.programming.kicks-ass.net>
Date: Mon, 16 Dec 2024 17:54:19 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelaf@...gle.com>,
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
	Boqun Feng <boqun.feng@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Metin Kaya <Metin.Kaya@....com>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Daniel Lezcano <daniel.lezcano@...aro.org>, kernel-team@...roid.com,
	Connor O'Brien <connoro@...gle.com>
Subject: Re: [RFC][PATCH v14 2/7] locking/mutex: Rework
 task_struct::blocked_on

On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> >
> > > Also add a blocked_on_state value so we can distinguish when a
> > > task is blocked_on a mutex, but is either blocked, waking up, or
> > > runnable (such that it can try to acquire the lock its blocked
> > > on).
> > >
> > > This avoids some of the subtle & racy games where the blocked_on
> > > state gets cleared, only to have it re-added by the
> > > mutex_lock_slowpath call when it tries to acquire the lock on
> > > wakeup
> >
> > If you can remember those sublte cases, I'm sure our future selves
> > would've loved it if you wrote a comment to go with these states :-)
> 
> Thanks so much for the review feedback! I really appreciate it!
> 
> So yes, the description can use improvement here. I at one time had
> 3-4 separate very fine grained patches (see the top 4 patches here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> that I rolled into one when sending out(mostly to avoid overwhelming
> folks), but the squished commit description isn't as clear.
> So if it's helpful I can split this back out?
> 
> I'll also add some better comments as well.

Not sure yet about splitting back out -- let me try and figure out what
all is actually done / needed.

So blocked_lock started out as another lock around ttwu(), in order to
serialize the task wakeup vs reading a remote ->blocked_on relation.

Since we do this with rq->lock held, it can't be ->pi_lock, and hence
->blocked_lock was born.

Later patches appear to have moved it into mutex, mirroring the
->wait_lock -- this is probably better.

/me goes chase that state thing for a bit..

So, this all seems to have started with:

  https://github.com/johnstultz-work/linux-dev/commit/c122552e07b75bb39d0297431799801de30f147f


> > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > >                       goto err_early_kill;
> > >       }
> > >
> > > +     set_task_blocked_on(current, lock);
> > >       set_current_state(state);
> >
> > blocked_on_state mirrors task-state
> 
> Yea. I know I always move a little fast in my talks but at OSPM and
> LPC, but I've raised the point that the blocked_on_state very much
> seems like it aliases the task->__state.
> (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> 
> I've not quite worked out how to integrate it though.
> 
> My initial introduction of the blocked_on_state was mostly to help
> detangle the code, as there were a number of cases originally where in
> order to let the task be selected to try to acquire the mutex, we
> cleared the task->blocked_on pointer.  But this often ran into races
> with ttwu, as if it cleared the blocked_on pointer, the task could get
> selected to run before the return migration happened. So having the
> tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> enforced the rules properly so we didn't run on the wrong cpu.

Right, so we already have a TASK_WAKING state, that is that
intermediate. But let me try and get a feel for how things relate;
TASK_WAKING is after we've determined it is elegible for wakeup, but
before it is enqueued -- eg. it sits on the wakelist.

This BO_WAKING is not quite that. It happens before doing the actual
wakeup.

Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
next to your slides, I'm a little confused.

Specifically, slide 27 talks about a modification to try_to_wake_up() in
order to force the task into ttwu_runnable() such that it then hits
proxy_needs_return(). This latter part I can find, but not the former.

/me puzzles

Hmm, since a blocked task is on the runqueue and all that, you should
always walk that path, anything else would be buggy.

So fundamentally the problem is that you want to do a reverse migration
-- it needs to go back to a CPU where it is allowed to run. The way it
does this is to dequeue itself and let ttwu continue and clean up the
mess.

This all makes sense, but why can't you do this on the waking side? That
is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
deactivate_task() thing right there, before issuing the wakeup.

Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():

 - probably doesn't need current->blocked_lock, current isn't blocked
 
 - probably doesn't need lock->wait_lock, donor is stable under
   preempt_disable().


Something like the completely untested below (also, ttwu path needs
surgery now):

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2711af8c0052..47cfa01cd066 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	}
 
 	preempt_disable();
+	next = proxy_handoff(lock);
+
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 
-	if (sched_proxy_exec()) {
-		raw_spin_lock(&current->blocked_lock);
-		/*
-		 * If we have a task boosting current, and that task was boosting
-		 * current through this lock, hand the lock to that task, as that
-		 * is the highest waiter, as selected by the scheduling function.
-		 */
-		donor = current->blocked_donor;
-		if (donor) {
-			struct mutex *next_lock;
-
-			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
-			next_lock = get_task_blocked_on(donor);
-			if (next_lock == lock) {
-				next = donor;
-				donor->blocked_on_state = BO_WAKING;
-				wake_q_add(&wake_q, donor);
-				current->blocked_donor = NULL;
-			}
-			raw_spin_unlock(&donor->blocked_lock);
-		}
-	}
-
 	/*
 	 * Failing that, pick any on the wait list.
 	 */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6eaffa913495..30d7371bb5c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
 }
 #endif /* CONFIG_SCHED_PROXY_EXEC */
 
+struct task_struct *proxy_handoff(struct mutex *lock);
+{
+	struct task_struct *next;
+
+	if (!sched_proxy_exec())
+		return NULL;
+
+	/*
+	 * current->blocked_donor can't change if we can't schedule
+	 * caller needs to do this, since its needs stabiliy of return value
+	 */
+	lockdep_assert_preemption_disabled();
+	next = current->blocked_donor;
+	if (!next)
+		return NULL;
+
+	scoped_guard (task_rq_lock, next) {
+		/*
+		 * current->blocked_donor had better be on the same CPU as current
+		 */
+		WARN_ON_ONCE(scope.rq != this_rq());
+
+		scoped_guard (raw_spin_lock, next->blocked_lock) {
+			/*
+			 * WARN_ON on this? How can this happen
+			 */
+			if (next->blocked_on != lock)
+				return NULL;
+		}
+
+		/*
+		 * blocked_on relation is stable, since we hold both
+		 * next->pi_lock and it's rq->lock
+		 *
+		 * OK -- we have a donor, it is blocked on the lock we're about
+		 * to release and it cannot run on this CPU -- fixies are
+		 * required.
+		 *
+		 * Dequeue the task, such that ttwu() can fix up the placement thing.
+		 */
+		if (!is_cpu_allowed(next, cpu_of(scope.rq)))
+			deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
+	}
+
+	return next;
+}
+
 #ifdef CONFIG_SMP
 static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 {