linux-kernel - Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCraMepXyQPs1q-aNa+Gh745WpaFPkngA9Eohi9vXRpe+w@mail.gmail.com>
Date: Fri, 13 Dec 2024 19:39:57 -0800
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelaf@...gle.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, kernel-team@...roid.com, 
	"Connor O'Brien" <connoro@...gle.com>
Subject: Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on

On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
>
> > Also add a blocked_on_state value so we can distinguish when a
> > task is blocked_on a mutex, but is either blocked, waking up, or
> > runnable (such that it can try to acquire the lock its blocked
> > on).
> >
> > This avoids some of the subtle & racy games where the blocked_on
> > state gets cleared, only to have it re-added by the
> > mutex_lock_slowpath call when it tries to acquire the lock on
> > wakeup
>
> If you can remember those sublte cases, I'm sure our future selves
> would've loved it if you wrote a comment to go with these states :-)

Thanks so much for the review feedback! I really appreciate it!

So yes, the description can use improvement here. I at one time had
3-4 separate very fine grained patches (see the top 4 patches here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
that I rolled into one when sending out(mostly to avoid overwhelming
folks), but the squished commit description isn't as clear.
So if it's helpful I can split this back out?

I'll also add some better comments as well.


> > +static inline void set_blocked_on_runnable(struct task_struct *p)
> > +{
> > +     unsigned long flags;
> > +
> > +     if (!sched_proxy_exec())
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&p->blocked_lock, flags);
>
> Do we want to make this:
>
>         guard(raw_spinlock_irqsave)(&p->blocked_lock);
>
> ?

Sure, sorry, while I've seen the guard bits, I've not yet really utilized them.
I'll take a stab at this for v15.

>
> > +     __set_blocked_on_runnable(p);
> > +     raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> > +}
> > +
> > +static inline void set_blocked_on_waking(struct task_struct *p)
>
> consistent naming would be __set_blocked_on_waking()

Yeah, I started with the unlocked versions and then added the one case
that takes the lock, but that does make it inconsistent. I'll rework
all these. Thanks for pointing this out!


> > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       goto err_early_kill;
> >       }
> >
> > +     set_task_blocked_on(current, lock);
> >       set_current_state(state);
>
> blocked_on_state mirrors task-state

Yea. I know I always move a little fast in my talks but at OSPM and
LPC, but I've raised the point that the blocked_on_state very much
seems like it aliases the task->__state.
(See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)

I've not quite worked out how to integrate it though.

My initial introduction of the blocked_on_state was mostly to help
detangle the code, as there were a number of cases originally where in
order to let the task be selected to try to acquire the mutex, we
cleared the task->blocked_on pointer.  But this often ran into races
with ttwu, as if it cleared the blocked_on pointer, the task could get
selected to run before the return migration happened. So having the
tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
enforced the rules properly so we didn't run on the wrong cpu.

Trying to squish this tristate into the task->__state has eluded me a
bit (mostly due to the lockfree manipulations being a little subtle -
defaulting to RUNNABLE has been usually safe, as the task will just
loop on the condition but with this we really don't want to lose the
signal that we need to do a return migration before the task runs), so
I'd love your insight here.

One thing I've been thinking about is reworking the blocked_on_state
to instead just be a "needs_return" flag. This might simplify things
as we could only set needs_return when we proxy_migrate away from the
task->wake_cpu, so we wouldn't have to handle the WAKING->RUNNABLE
transition for tasks that were never migrated.  And, once it's down to
a single bit, that might work better as a task->__state flag (but
again, it has to prevent transition to TASK_RUNNING until we migrate
back).

Thoughts?


> > @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       break;
> >
> >               if (first) {
> > +                     bool opt_acquired;
> > +
> > +                     /*
> > +                      * mutex_optimistic_spin() can schedule, so  we need to
> > +                      * release these locks before calling it.
> > +                      */
> > +                     current->blocked_on_state = BO_RUNNABLE;
> > +                     raw_spin_unlock(&current->blocked_lock);
> > +                     raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> >                       trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> > -                     if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> > +                     opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
>
> I'm confused -- didn't I kill optimistic spinning for proxy? IIRC it
> fundamentally doesn't make sense since we do a hand-off to the donor
> thread.

So, the trouble was killing optimistic spinning with proxy-exec had a
larger performance impact. So around v8 (this time last year, I
think), I worked to re-enable optimistic spinning.

The idea is as long as the lock owner is running, we do the optimistic
spin as there's no benefit to proxy boosting it (the owner is already
running, mission accomplished). But, if it's not on_cpu, then we
consider ourselves blocked, and will go into __schedule() where we can
be selected and boost whomever the chain is waiting on. Then if we are
proxy boosting, on release we can hand it over to the boosting donor.

This feels like it nicely balances the normal non-proxy performance
where possible, only switching into proxy-boosting when things are
going to take longer to release.

> All in all I'm properly confused by this patch... please write
> more/better comments changelog.

Sure, I think it's probably worth splitting it out, so I'll take a
swing at that before v15.

Definitely would appreciate your thoughts on the idea for a
TASK_NEEDS_RETURN __state flag or if you see a more elegant way to put
an intermediate step in place on the wakeup to ensure we don't wake
proxy-migrated tasks on the wrong cpu.

Thanks so much again for the input! Really appreciate the feedback!
-john