linux-kernel - Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCpTfZFOkUkB4f4iQwXA3wnsDuUA_1ZLuseGYunnpgO9Rw@mail.gmail.com>
Date: Mon, 16 Dec 2024 21:01:24 -0800
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelaf@...gle.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, kernel-team@...roid.com, 
	"Connor O'Brien" <connoro@...gle.com>
Subject: Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@...radead.org> wrote:
> > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > So yes, the description can use improvement here. I at one time had
> > 3-4 separate very fine grained patches (see the top 4 patches here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > that I rolled into one when sending out(mostly to avoid overwhelming
> > folks), but the squished commit description isn't as clear.
> > So if it's helpful I can split this back out?
> >
> > I'll also add some better comments as well.
>
> Not sure yet about splitting back out -- let me try and figure out what
> all is actually done / needed.
>
> So blocked_lock started out as another lock around ttwu(), in order to
> serialize the task wakeup vs reading a remote ->blocked_on relation.

I think of it primarily to serialize the task->blocked* state (there
gets to be quite a bit by the end of the proxy series).

> Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> ->blocked_lock was born.

Yeah, we needed to use something other than the task->pi_lock to
serialize it as it has to nest under the mutex->wait_lock.

> Later patches appear to have moved it into mutex, mirroring the
> ->wait_lock -- this is probably better.
>
> /me goes chase that state thing for a bit..

? I'm not sure I followed this.  The blocked_lock continues to
serialize the task->blocked* state through the patch series.


> > > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > > >                       goto err_early_kill;
> > > >       }
> > > >
> > > > +     set_task_blocked_on(current, lock);
> > > >       set_current_state(state);
> > >
> > > blocked_on_state mirrors task-state
> >
> > Yea. I know I always move a little fast in my talks but at OSPM and
> > LPC, but I've raised the point that the blocked_on_state very much
> > seems like it aliases the task->__state.
> > (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> >
> > I've not quite worked out how to integrate it though.
> >
> > My initial introduction of the blocked_on_state was mostly to help
> > detangle the code, as there were a number of cases originally where in
> > order to let the task be selected to try to acquire the mutex, we
> > cleared the task->blocked_on pointer.  But this often ran into races
> > with ttwu, as if it cleared the blocked_on pointer, the task could get
> > selected to run before the return migration happened. So having the
> > tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> > enforced the rules properly so we didn't run on the wrong cpu.
>
> Right, so we already have a TASK_WAKING state, that is that
> intermediate. But let me try and get a feel for how things relate;
> TASK_WAKING is after we've determined it is elegible for wakeup, but
> before it is enqueued -- eg. it sits on the wakelist.
>
> This BO_WAKING is not quite that. It happens before doing the actual
> wakeup.

Right. So BO_WAKING is set from the point that the mutex unlock (or
ww_mutex_wound/die code) tries to wake up a task, as the task may have
been proxy-migrated, so we need to return it back before it runs.
So it's sort of a guard to make sure __schedule() doesn't just run the
task (as we could if it were BO_RUNNABLE), but also that we don't try
to proxy for it (as we would if it were BO_BLOCKED), since it needs to
be migrated.

My suspicion is reframing BO_WAKING as TASK_NEEDS_RETURN might be
doable, but I couldn't be sure we wouldn't hit an edge case where
TASK_RUNNING gets written over everything - effectively clearing the
guard.
Maybe the NEEDS_RETURN could just be a side state similar to the
blocked_on_state, and then BO_BLOCKED vs BO_RUNNABLE would just be
distinguishable by the !!task->blocked_on value.

> Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
> next to your slides, I'm a little confused.

So, apologies for the confusion. The link I sent you to the
fine-grained changes are a bit old (v7), since I stopped maintaining
the fine-grained patches around v11 (and the most recently shared
version is v14).
I just shared the v7 link as an example to see if it would be helpful
to split the patches here out again in a similar fashion (and I don't
think I published later versions of the fine grained series).

> Specifically, slide 27 talks about a modification to try_to_wake_up() in
> order to force the task into ttwu_runnable() such that it then hits
> proxy_needs_return(). This latter part I can find, but not the former.
>
> /me puzzles

So the slides actually have links to the code at that time, which
should be the v12 series:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5

And the try_to_wake_up() change from slide 27 is here:
https://github.com/johnstultz-work/linux-dev/commit/cc828a6bac87dcd5734902021973659fe52ef7e6#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4158

(and here's the link to the v14 version of the same:
https://github.com/johnstultz-work/linux-dev/commit/602a4197c83b83cff08c7adda31324739c7938c7#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4314
)

> Hmm, since a blocked task is on the runqueue and all that, you should
> always walk that path, anything else would be buggy.
>
> So fundamentally the problem is that you want to do a reverse migration
> -- it needs to go back to a CPU where it is allowed to run. The way it
> does this is to dequeue itself and let ttwu continue and clean up the
> mess.

Yep.

> This all makes sense, but why can't you do this on the waking side? That
> is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
> deactivate_task() thing right there, before issuing the wakeup.
>
> Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():
>
>  - probably doesn't need current->blocked_lock, current isn't blocked
>
>  - probably doesn't need lock->wait_lock, donor is stable under
>    preempt_disable().
>
>
> Something like the completely untested below (also, ttwu path needs
> surgery now):
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 2711af8c0052..47cfa01cd066 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>         }
>
>         preempt_disable();
> +       next = proxy_handoff(lock);
> +
>         raw_spin_lock_irqsave(&lock->wait_lock, flags);
>         debug_mutex_unlock(lock);
>
> -       if (sched_proxy_exec()) {
> -               raw_spin_lock(&current->blocked_lock);
> -               /*
> -                * If we have a task boosting current, and that task was boosting
> -                * current through this lock, hand the lock to that task, as that
> -                * is the highest waiter, as selected by the scheduling function.
> -                */
> -               donor = current->blocked_donor;
> -               if (donor) {
> -                       struct mutex *next_lock;
> -
> -                       raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
> -                       next_lock = get_task_blocked_on(donor);
> -                       if (next_lock == lock) {
> -                               next = donor;
> -                               donor->blocked_on_state = BO_WAKING;
> -                               wake_q_add(&wake_q, donor);
> -                               current->blocked_donor = NULL;
> -                       }
> -                       raw_spin_unlock(&donor->blocked_lock);
> -               }
> -       }
> -
>         /*
>          * Failing that, pick any on the wait list.
>          */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
>  }
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> +       struct task_struct *next;
> +
> +       if (!sched_proxy_exec())
> +               return NULL;
> +
> +       /*
> +        * current->blocked_donor can't change if we can't schedule
> +        * caller needs to do this, since its needs stabiliy of return value
> +        */
> +       lockdep_assert_preemption_disabled();
> +       next = current->blocked_donor;
> +       if (!next)
> +               return NULL;
> +
> +       scoped_guard (task_rq_lock, next) {
> +               /*
> +                * current->blocked_donor had better be on the same CPU as current
> +                */
> +               WARN_ON_ONCE(scope.rq != this_rq());
> +
> +               scoped_guard (raw_spin_lock, next->blocked_lock) {
> +                       /*
> +                        * WARN_ON on this? How can this happen
> +                        */
> +                       if (next->blocked_on != lock)
> +                               return NULL;
> +               }
> +
> +               /*
> +                * blocked_on relation is stable, since we hold both
> +                * next->pi_lock and it's rq->lock
> +                *
> +                * OK -- we have a donor, it is blocked on the lock we're about
> +                * to release and it cannot run on this CPU -- fixies are
> +                * required.
> +                *
> +                * Dequeue the task, such that ttwu() can fix up the placement thing.
> +                */
> +               if (!is_cpu_allowed(next, cpu_of(scope.rq)))

nit, we'd want to check its not the wake_cpu so we try to return it so
proxy migrations don't upset the tasks' original placement

> +                       deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> +       }
> +
> +       return next;
> +}
> +

Ok. I'll stare at all this a bit and see if I can give it a try.  I
fret that this doesn't handle the case if wakeups on the task occur
through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
prevent us from running until we migrate back). I don't really have a
specific case I can articulate, but my gut is telling me the problem
will be w/ ww_mutexes as that was a major source of problems with the
early versions of the patches that I believe tried to use logic
similar to this.

Again, I really appreciate your insight and suggestions here!

thanks
-john