[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCpTfZFOkUkB4f4iQwXA3wnsDuUA_1ZLuseGYunnpgO9Rw@mail.gmail.com>
Date: Mon, 16 Dec 2024 21:01:24 -0800
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelaf@...gle.com>,
Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>,
"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>,
Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>,
Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, kernel-team@...roid.com,
"Connor O'Brien" <connoro@...gle.com>
Subject: Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@...radead.org> wrote:
> > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > So yes, the description can use improvement here. I at one time had
> > 3-4 separate very fine grained patches (see the top 4 patches here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > that I rolled into one when sending out(mostly to avoid overwhelming
> > folks), but the squished commit description isn't as clear.
> > So if it's helpful I can split this back out?
> >
> > I'll also add some better comments as well.
>
> Not sure yet about splitting back out -- let me try and figure out what
> all is actually done / needed.
>
> So blocked_lock started out as another lock around ttwu(), in order to
> serialize the task wakeup vs reading a remote ->blocked_on relation.
I think of it primarily to serialize the task->blocked* state (there
gets to be quite a bit by the end of the proxy series).
> Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> ->blocked_lock was born.
Yeah, we needed to use something other than the task->pi_lock to
serialize it as it has to nest under the mutex->wait_lock.
> Later patches appear to have moved it into mutex, mirroring the
> ->wait_lock -- this is probably better.
>
> /me goes chase that state thing for a bit..
? I'm not sure I followed this. The blocked_lock continues to
serialize the task->blocked* state through the patch series.
> > > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > > > goto err_early_kill;
> > > > }
> > > >
> > > > + set_task_blocked_on(current, lock);
> > > > set_current_state(state);
> > >
> > > blocked_on_state mirrors task-state
> >
> > Yea. I know I always move a little fast in my talks but at OSPM and
> > LPC, but I've raised the point that the blocked_on_state very much
> > seems like it aliases the task->__state.
> > (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> >
> > I've not quite worked out how to integrate it though.
> >
> > My initial introduction of the blocked_on_state was mostly to help
> > detangle the code, as there were a number of cases originally where in
> > order to let the task be selected to try to acquire the mutex, we
> > cleared the task->blocked_on pointer. But this often ran into races
> > with ttwu, as if it cleared the blocked_on pointer, the task could get
> > selected to run before the return migration happened. So having the
> > tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> > enforced the rules properly so we didn't run on the wrong cpu.
>
> Right, so we already have a TASK_WAKING state, that is that
> intermediate. But let me try and get a feel for how things relate;
> TASK_WAKING is after we've determined it is elegible for wakeup, but
> before it is enqueued -- eg. it sits on the wakelist.
>
> This BO_WAKING is not quite that. It happens before doing the actual
> wakeup.
Right. So BO_WAKING is set from the point that the mutex unlock (or
ww_mutex_wound/die code) tries to wake up a task, as the task may have
been proxy-migrated, so we need to return it back before it runs.
So it's sort of a guard to make sure __schedule() doesn't just run the
task (as we could if it were BO_RUNNABLE), but also that we don't try
to proxy for it (as we would if it were BO_BLOCKED), since it needs to
be migrated.
My suspicion is reframing BO_WAKING as TASK_NEEDS_RETURN might be
doable, but I couldn't be sure we wouldn't hit an edge case where
TASK_RUNNING gets written over everything - effectively clearing the
guard.
Maybe the NEEDS_RETURN could just be a side state similar to the
blocked_on_state, and then BO_BLOCKED vs BO_RUNNABLE would just be
distinguishable by the !!task->blocked_on value.
> Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
> next to your slides, I'm a little confused.
So, apologies for the confusion. The link I sent you to the
fine-grained changes are a bit old (v7), since I stopped maintaining
the fine-grained patches around v11 (and the most recently shared
version is v14).
I just shared the v7 link as an example to see if it would be helpful
to split the patches here out again in a similar fashion (and I don't
think I published later versions of the fine grained series).
> Specifically, slide 27 talks about a modification to try_to_wake_up() in
> order to force the task into ttwu_runnable() such that it then hits
> proxy_needs_return(). This latter part I can find, but not the former.
>
> /me puzzles
So the slides actually have links to the code at that time, which
should be the v12 series:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5
And the try_to_wake_up() change from slide 27 is here:
https://github.com/johnstultz-work/linux-dev/commit/cc828a6bac87dcd5734902021973659fe52ef7e6#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4158
(and here's the link to the v14 version of the same:
https://github.com/johnstultz-work/linux-dev/commit/602a4197c83b83cff08c7adda31324739c7938c7#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4314
)
> Hmm, since a blocked task is on the runqueue and all that, you should
> always walk that path, anything else would be buggy.
>
> So fundamentally the problem is that you want to do a reverse migration
> -- it needs to go back to a CPU where it is allowed to run. The way it
> does this is to dequeue itself and let ttwu continue and clean up the
> mess.
Yep.
> This all makes sense, but why can't you do this on the waking side? That
> is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
> deactivate_task() thing right there, before issuing the wakeup.
>
> Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():
>
> - probably doesn't need current->blocked_lock, current isn't blocked
>
> - probably doesn't need lock->wait_lock, donor is stable under
> preempt_disable().
>
>
> Something like the completely untested below (also, ttwu path needs
> surgery now):
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 2711af8c0052..47cfa01cd066 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> }
>
> preempt_disable();
> + next = proxy_handoff(lock);
> +
> raw_spin_lock_irqsave(&lock->wait_lock, flags);
> debug_mutex_unlock(lock);
>
> - if (sched_proxy_exec()) {
> - raw_spin_lock(¤t->blocked_lock);
> - /*
> - * If we have a task boosting current, and that task was boosting
> - * current through this lock, hand the lock to that task, as that
> - * is the highest waiter, as selected by the scheduling function.
> - */
> - donor = current->blocked_donor;
> - if (donor) {
> - struct mutex *next_lock;
> -
> - raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
> - next_lock = get_task_blocked_on(donor);
> - if (next_lock == lock) {
> - next = donor;
> - donor->blocked_on_state = BO_WAKING;
> - wake_q_add(&wake_q, donor);
> - current->blocked_donor = NULL;
> - }
> - raw_spin_unlock(&donor->blocked_lock);
> - }
> - }
> -
> /*
> * Failing that, pick any on the wait list.
> */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
> }
> #endif /* CONFIG_SCHED_PROXY_EXEC */
>
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> + struct task_struct *next;
> +
> + if (!sched_proxy_exec())
> + return NULL;
> +
> + /*
> + * current->blocked_donor can't change if we can't schedule
> + * caller needs to do this, since its needs stabiliy of return value
> + */
> + lockdep_assert_preemption_disabled();
> + next = current->blocked_donor;
> + if (!next)
> + return NULL;
> +
> + scoped_guard (task_rq_lock, next) {
> + /*
> + * current->blocked_donor had better be on the same CPU as current
> + */
> + WARN_ON_ONCE(scope.rq != this_rq());
> +
> + scoped_guard (raw_spin_lock, next->blocked_lock) {
> + /*
> + * WARN_ON on this? How can this happen
> + */
> + if (next->blocked_on != lock)
> + return NULL;
> + }
> +
> + /*
> + * blocked_on relation is stable, since we hold both
> + * next->pi_lock and it's rq->lock
> + *
> + * OK -- we have a donor, it is blocked on the lock we're about
> + * to release and it cannot run on this CPU -- fixies are
> + * required.
> + *
> + * Dequeue the task, such that ttwu() can fix up the placement thing.
> + */
> + if (!is_cpu_allowed(next, cpu_of(scope.rq)))
nit, we'd want to check its not the wake_cpu so we try to return it so
proxy migrations don't upset the tasks' original placement
> + deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> + }
> +
> + return next;
> +}
> +
Ok. I'll stare at all this a bit and see if I can give it a try. I
fret that this doesn't handle the case if wakeups on the task occur
through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
prevent us from running until we migrate back). I don't really have a
specific case I can articulate, but my gut is telling me the problem
will be w/ ww_mutexes as that was a major source of problems with the
early versions of the patches that I believe tried to use logic
similar to this.
Again, I really appreciate your insight and suggestions here!
thanks
-john
Powered by blists - more mailing lists