lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCpOdnmC+dB+uwt82XrxFuzx72d+5w1j21eFovSeFr8pDw@mail.gmail.com>
Date: Fri, 7 Nov 2025 15:18:32 -0800
From: John Stultz <jstultz@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Valentin Schneider <vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Daniel Lezcano <daniel.lezcano@...aro.org>, Suleiman Souhlal <suleiman@...gle.com>, 
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, kernel-team@...roid.com
Subject: Re: [PATCH v23 6/9] sched: Handle blocked-waiter migration (and
 return migration)

On Thu, Oct 30, 2025 at 2:33 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
> On 10/30/2025 5:48 AM, John Stultz wrote:
> > -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> > +/*
> > + * If the blocked-on relationship crosses CPUs, migrate @p to the
> > + * owner's CPU.
> > + *
> > + * This is because we must respect the CPU affinity of execution
> > + * contexts (owner) but we can ignore affinity for scheduling
> > + * contexts (@p). So we have to move scheduling contexts towards
> > + * potential execution contexts.
> > + *
> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p, int target_cpu)
> >  {
> > -     if (!__proxy_deactivate(rq, donor)) {
> > -             /*
> > -              * XXX: For now, if deactivation failed, set donor
> > -              * as unblocked, as we aren't doing proxy-migrations
> > -              * yet (more logic will be needed then).
> > -              */
> > -             clear_task_blocked_on(donor, NULL);
> > +     struct rq *target_rq = cpu_rq(target_cpu);
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     /*
> > +      * Since we're going to drop @rq, we have to put(@rq->donor) first,
> > +      * otherwise we have a reference that no longer belongs to us.
> > +      *
> > +      * Additionally, as we put_prev_task(prev) earlier, its possible that
> > +      * prev will migrate away as soon as we drop the rq lock, however we
> > +      * still have it marked as rq->curr, as we've not yet switched tasks.
> > +      *
> > +      * So call proxy_resched_idle() to let go of the references before
> > +      * we release the lock.
> > +      */
> > +     proxy_resched_idle(rq);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
>
> DEQUEUE_NOCLOCK since we arrive here with the clock updated from
> schedule().

Ah, good point.

> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +     raw_spin_rq_lock(target_rq);
> > +
> > +     activate_task(target_rq, p, 0);
> > +     wakeup_preempt(target_rq, p, 0);
> > +
> > +     raw_spin_rq_unlock(target_rq);
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
>
> We should perhaps update the clock once we've reacquired the rq_lock
> given we are going into schedule() again for another pick.

No objection to adding this, though I wonder why I'm not hitting the
usual warnings. I'll have to think through that a bit.


> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = 0;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     get_task_struct(p);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
> > +     update_rq_clock(this_rq);
>
> I think we can delay the clock update until proxy_resched_idle().

You're thinking just to avoid it if we trip into the error paths?  Sounds good.


> > +
> > +     /*
> > +      * Since we let go of the rq lock, the task may have been
> > +      * woken or migrated to another rq before we  got the
> > +      * task_rq_lock. So re-check we're on the same RQ. If
> > +      * not, the task has already been migrated and that CPU
> > +      * will handle any futher migrations.
> > +      */
> > +     if (this_rq != rq)
> > +             goto err_out;
> > +
> > +     /* Similarly, if we've been dequeued, someone else will wake us */
> > +     if (!task_on_rq_queued(p))
> > +             goto err_out;
> > +
> > +     /*
> > +      * Since we should only be calling here from __schedule()
> > +      * -> find_proxy_task(), no one else should have
> > +      * assigned current out from under us. But check and warn
> > +      * if we see this, then bail.
> > +      */
> > +     if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > +             WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> > +                       __func__, cpu_of(this_rq),
> > +                       p->comm, p->pid, p->on_cpu);
> > +             goto err_out;
> >       }
> > -     return NULL;
> > +
> > +     proxy_resched_idle(this_rq);
> > +     deactivate_task(this_rq, p, 0);
>
> This should add DEQUEUE_NOCLOCK since we've already updated the rq clock
> before the call.

Ack.


> > +     cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > +     set_task_cpu(p, cpu);
> > +     target_rq = cpu_rq(cpu);
> > +     clear_task_blocked_on(p, NULL);
> > +     task_rq_unlock(this_rq, p, &this_rf);
> > +
> > +     /* Drop this_rq and grab target_rq for activation */
> > +     raw_spin_rq_lock(target_rq);
> > +     activate_task(target_rq, p, 0);
> > +     wakeup_preempt(target_rq, p, 0);
> > +     put_task_struct(p);
> > +     raw_spin_rq_unlock(target_rq);
> > +
> > +     /* Finally, re-grab the origianl rq lock and return to pick-again */
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
> > +     return;
> > +
> > +err_out:
> > +     put_task_struct(p);
> > +     task_rq_unlock(this_rq, p, &this_rf);
>
> I believe as long a we have the task_rq_lock(), the task cannot
> disappear under us but are there any concern on doing a
> put_task_struct() and then using the same task reference for
> task_rq_unlock()?

Yeah. Seems proper to do it the way you're suggesting.


> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
>
> Probably a clock update once we reacquire the rq_lock since we
> go into schedule() again to retry pick?
>

Ack.


> > @@ -6689,26 +6834,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >                       return NULL;
> >               }
> >
> > +             if (task_current(rq, p))
> > +                     curr_in_chain = true;
> > +
> >               owner = __mutex_owner(mutex);
> >               if (!owner) {
> >                       /*
> > -                      * If there is no owner, clear blocked_on
> > -                      * and return p so it can run and try to
> > -                      * acquire the lock
> > +                      * If there is no owner, either clear blocked_on
> > +                      * and return p (if it is current and safe to
> > +                      * just run on this rq), or return-migrate the task.
> >                        */
> > -                     __clear_task_blocked_on(p, mutex);
> > -                     return p;
> > +                     if (task_current(rq, p)) {
> > +                             __clear_task_blocked_on(p, NULL);
> > +                             return p;
> > +                     }
> > +                     action = NEEDS_RETURN;
> > +                     break;
> >               }
> >
> >               if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>
> Should we handle task_on_rq_migrating() in the similar way?
> Wait for the owner to finish migrating and look at the
> task_cpu(owner) once it is reliable?

Hrm. I'm not quite sure I understand your suggestion here. Could you
expand a bit here? Are you thinking we should deactivate the donor
when the owner is migrating? What would then return the donor to the
runqueue? Just rescheduling idle so that we drop the rq lock
momentarily should be sufficient to make sure the owner can finish
migration.

As always, I really appreciate your review and feedback!
Thanks so much!
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ