lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625203110.2299275-1-jstultz@google.com>
Date: Wed, 25 Jun 2025 20:30:53 +0000
From: John Stultz <jstultz@...gle.com>
To: LKML <linux-kernel@...r.kernel.org>
Cc: John Stultz <jstultz@...gle.com>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Valentin Schneider <vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, 
	kernel-team@...roid.com
Subject: [PATCH v18 0/8] Single RunQueue Proxy Execution (v18)

Hey All,

After not getting much response from the v17 series (and
resending it), I was going to continue to just iterate resending
the v17 single runqueue focused series. However, Suleiman had a
very good suggestion for improving the larger patch series and a
few of the tweaks for those changes trickled back into the set
I’m submitting here.

Unfortunately those later changes also uncovered some stability
problems with the full proxy-exec patch series, which took a
painfully long time (stress testing taking 30-60 hours to trip
the problem) to resolve. However, after finally sorting those
issues out it has been running well, so I can now send out the
next revision (v18) of the set.

So here is v18 of the Proxy Execution series, a generalized form
of priority inheritance.

As I’m trying to submit this work in smallish digestible pieces,
in this series, I’m only submitting for review the logic that
allows us to do the proxying if the lock owner is on the same
runqueue as the blocked waiter: Introducing the
CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
task_struct::blocked_on pointer and wrapper functions, the
initial sketch of the find_proxy_task() logic, some fixes for
using split contexts, and finally same-runqueue proxying. 

As I mentioned above, for the series I’m submitting here, it has
only barely changed from v17. With the main difference being
slightly different order of checks for cases where we don’t
actually do anything yet (more on why below), and use of
READ_ONCE for the on_rq reads to avoid the compiler fusing
loads, which I was bitten by with the full series.

However for the full proxy-exec series, there are a few
differences:
* Suleiman Souhlal noticed an inefficiency in that we evaluate
  if the lock owner’s task_cpu() is the current cpu, before we
  look to see if the lock owner is on_rq at all. With v17 this
  would result in us proxy-migrating a donor to a remote cpu,
  only to then realize the task wasn’t even on the runqueue,
  and doing the sleeping owner enqueuing. Suleiman suggested
  instead that we evaluate on_rq first, so we can immediately do
  sleeping owner enqueueing. Then only if the owner is on a
  runqueue do we proxy-migrate the donor (which requires the
  more costly lock juggling). While not a huge logical change,
  it did uncover other problems, which needed to be resolved.

* One issue found was there was a race where if
  do_activate_blocked_waiter() from the sleeping owner wakeup
  was delayed and the task had already been woken up elsewhere.
  It’s possible if that task was running and called into
  schedule() to be blocked, it would be dequeued from the
  runqueue, but before we switched to the new task,
  do_activate_blocked_waiter() might try to activate it on a
  different cpu. Clearly the do_activate_blocked_waiter() needed
  to check the task on_cpu value as well.

* I found that we still can hit wakeups that end up skipping the
  BO_WAKING -> BO_RUNNALBE transition (causing find_proxy_task()
  to end up spinning waiting for that transition), so I re-added
  the logic to handle doing return migrations from
  find_proxy_task() if we hit that case.

* Hupu suggested a tweak in ttwu_runnable() to evaluate
  proxy_needs_return() slightly earlier.

* Kuyo Chang reported and isolated a fix for a problem with
  __task_is_pushable() in the !sched_proxy_exec case, which was
  folded into the “sched: Fix rt/dl load balancing via chain
  level balance” patch

* Reworked some of the logic around releasing the rq->donor
  reference on migrations, using rq->idle directly.

* Sueliman also pointed out that some added task_struct elements
  were not being initialized in the init_task code path, so that
  was good to fix.

You can find the full series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v18-6.16-rc3/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v18-6.16-rc3

Issues still to address with the full series (not the patches
submitted here):
* Peter suggested an idea that instead of when tasks become
  unblocked, using (blocked_on_state == BO_WAKING) to protect
  against running proxy-migrated tasks on cpu’s they are not
  affined to, we could dequeue tasks first and then wake them.
  This does look to be cleaner in many ways, but the locking
  rework is significant and I’ve not worked out all the kinks
  with it yet. I am also a little worried that we may trip
  other wakeup paths that might not do the dequeue first.
  However, I have adopted this approach for the
  find_proxy_task() forced return migration, and it’s working
  well.

* Need to sort out what is needed for sched_ext to be ok with
  proxy-execution enabled.

* K Prateek Nayak did some testing about a bit over a year ago
  with an earlier version of the series and saw ~3-5%
  regressions in some cases. Need to re-evaluate this with the
  proxy-migration avoidance optimization Suleiman suggested now
  implemented.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@...dia.com>
Cc: Qais Yousef <qyousef@...alina.io>  
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Will Deacon <will@...nel.org>
Cc: Waiman Long <longman@...hat.com>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Metin Kaya <Metin.Kaya@....com>
Cc: Xuewen Yan <xuewen.yan94@...il.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Daniel Lezcano <daniel.lezcano@...aro.org>
Cc: Suleiman Souhlal <suleiman@...gle.com>
Cc: kuyo chang <kuyo.chang@...iatek.com>
Cc: hupu <hupu.gm@...il.com>
Cc: kernel-team@...roid.com

John Stultz (4):
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched: Move update_curr_task logic into update_curr_se
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Add an initial sketch of the find_proxy_task() function

Peter Zijlstra (2):
  locking/mutex: Rework task_struct::blocked_on
  sched: Start blocked_on chain processing in find_proxy_task()

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   5 +
 include/linux/sched.h                         |  72 ++++-
 init/Kconfig                                  |  12 +
 kernel/fork.c                                 |   3 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        |  18 ++
 kernel/locking/mutex.h                        |   3 +-
 kernel/locking/ww_mutex.h                     |  16 +-
 kernel/sched/core.c                           | 257 +++++++++++++++++-
 kernel/sched/deadline.c                       |   7 +
 kernel/sched/fair.c                           |  65 +++--
 kernel/sched/rt.c                             |   5 +
 kernel/sched/sched.h                          |  22 +-
 13 files changed, 448 insertions(+), 46 deletions(-)

-- 
2.50.0.727.gbf7dc18ff4-goog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ