lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230601055846.2349566-1-jstultz@google.com>
Date:   Thu,  1 Jun 2023 05:58:03 +0000
From:   John Stultz <jstultz@...gle.com>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     John Stultz <jstultz@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qyousef@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Zimuzo Ezeozue <zezeozue@...gle.com>,
        Youssef Esmat <youssefesmat@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Paul E . McKenney" <paulmck@...nel.org>, kernel-team@...roid.com
Subject: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

After having to catch up on other work after OSPM[1], I've finally
gotten back to focusing on Proxy Execution and wanted to send out this
next iteration of the patch series for review, testing, and feedback.
(Many thanks to folks who provided feedback on the last revision!)

As mentioned previously, this Proxy Execution series has a long history:
First described in a paper[2] by Watkins, Straub, Niehaus, then from
patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
for providing additional details here!)

So again, many thanks to those above, as all the credit for this series
really is due to them - while the mistakes are likely mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance. Classic
priority inheritance works well for real-time tasks where there is a
straight forward priority order to how things are run. But it breaks
down when used between CFS or DEADLINE tasks, as there are lots
of parameters involved outside of just the task’s nice value when
selecting the next task to run (via pick_next_task()).  So ideally we
want to imbue the mutex holder with all the scheduler attributes of 
the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked on, and
  which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so that we
  instead run the mutex holder. 

The first of these is the most difficult to grasp (I do get the mental
friction here: blocked tasks on the *run*queue sounds like nonsense!
Personally I like to think of the runqueue in this model more like a
“task-selection queue”).

By leaving blocked tasks on the runqueue, we allow pick_next_task() to
choose the task that should run next (even if it’s blocked waiting on a
mutex). If we do select a blocked task, we look at the task’s blocked_on
mutex and from there look at the mutex’s owner task. And in the simple
case, the task which owns the mutex is what we then choose to run,
allowing it to release the mutex.

This means that instead of just tracking “curr”, the scheduler needs to
track both the scheduler context (what was picked and all the state used
for scheduling decisions), and the execution context (what we’re
running)

In this way, the mutex owner is run “on behalf” of the blocked task
that was picked to run, essentially inheriting the scheduler context of
the blocked task.

As Connor outlined in a previous submission of this patch series,  this
raises a number of complicated situations:  The mutex owner might itself
be blocked on another mutex, or it could be sleeping, running on a
different CPU, in the process of migrating between CPUs, etc.

But the functionality provided by Proxy Execution is useful, as in
Android we have a number of cases where we are seeing priority inversion
(not unbounded, but longer than we’d like) between “foreground” and
“background” SCHED_NORMAL applications, so having a generalized solution
would be very useful.

New in v4:
—------
* Fixed deadlock that was caused by wait/wound mutexes having circular
  blocked_on references by clearing the blocked_on pointer on the task
  we are waking to wound/die. 

* Tried to resolve an issue Dietmar raised with RT balancing where the
  proxy migration and push_rt_task() were fighting ping-ponging tasks
  back and forth, caused by push_rt_task() migrating tasks even if they
  were in the chain that ends with the current running task. Though this
  likely needs more work, as this change resulted in different migration
  quirks (see below).

* Fixed a number of null-pointer traversals that the changes were
  occasionally tripping on

* Reworked patch that exposes __mutex_owner() to the scheduler to ensure
  it doesn’t expose it any more than necessary, as suggested by Peter. 

* To address some of Peter’s complaints, backed out the rq_curr()
  wrapper, and reworked rq_selected() to be a macro to avoid needing
  multiple accessors for READ_ONCE/rcu_dereference() type accesses. 

* Removed verbose legacy comments from previous developers of the series
  as Dietmar was finding them distracting when reviewing the diffs
  (Though, to ensure I heed the warnings from previous experienced
  travelers, I’ve preserved the comments/questions in a separate patch
  in my own development tree).

* Dropped patch that added *_task_blocked_on() wrappers to check locking
  correctness. Mostly as Peter didn’t seem happy with the wrappers in
  other patches, but I still think it's useful for testing (and the
  blocked_on locking still needs some work), so I’m carrying it in my
  personal development tree.

Issues still to address:
—----------
* Occasional getting null scheduler entities from pick_next_entity() in
  CFS. I’m a little stumped as to where this is going awry just yet, and
  delayed sending this out, but figured it was worth getting it out for
  review on the other issues while I chase this down.

* Better deadlock handling in proxy(): With the ww_mutex issues
  resolved, we shouldn’t see circular blocked_on references, but a
  number of the bugs I’ve been chasing recently come from getting stuck
  with proxy() returning null forcing a reselection over and over. These
  are still bugs to address, but my current thinking is that if we get
  stuck like this, we can start to remove the selected mutex blocked
  tasks from the rq, and let them be woken from the mutex waiters list
  as is done currently? Thoughts here would be appreciated.

* More work on migration correctness (RT/DL load balancing,etc). I’m
  still seeing occasional trouble as cpu counts go up which seems to be
  due to a bunch of tasks being proxy migrated to a cpu, then having to
  migrate them all away at once (seeing lots of pick again iterations).
  This may actually be correct, due to chain migration, but it ends up
  looking similar to a deadlock.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of
  a better name. Open to suggestions.

* As discussed at OSPM, I want to split pick_next_task() up into two
  phases selecting and setting the next tasks, as currently
  pick_next_task() assumes the returned task will be run which results
  in various side-effects in sched class logic when it’s run. This
  causes trouble should proxy() require us to re-select a task due to
  migration or other edge cases.

* CFS load balancing. Blocked tasks may carry forward load (PELT) to the
  lock owner's CPU, so CPU may look like it is overloaded.

* I still want to push down the split scheduler and execution context
  awareness further through the scheduling code, as lots of logic still
  assumes there’s only a single “rq->curr” task.

* Optimization to avoid migrating blocked tasks (allowing for optimistic
  spinning) if the runnable lock-owner at the end of the blocked_on chain
  is already running.  


Performance:
—----------
This patch series switches mutexes to use handoff mode rather than
optimistic spinning. This is a potential concern where locks are under
high contention. However, so far in our initial performance analysis (on
both x86 and mobile devices) we’ve not seen major regressions. That
said, Chenyu did report a regression[3], which we’ll need to look
further into. As mentioned above, there may be some optimizations that
can help here, but my focus is on getting the code working well before I
spend time optimizing.

Review and feedback would be greatly appreciated!

If folks find it easier to test/tinker with, this patch series can also
be found here (along with some debug patches):
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v4-6.4-rc3

Thanks so much!
-john

[1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/

Cc: Joel Fernandes <joelaf@...gle.com>
Cc: Qais Yousef <qyousef@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
Cc: Youssef Esmat <youssefesmat@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: Will Deacon <will@...nel.org>
Cc: Waiman Long <longman@...hat.com>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: "Paul E . McKenney" <paulmck@...nel.org>
Cc: kernel-team@...roid.com

Connor O'Brien (1):
  sched: Attempt to fix rt/dl load balancing via chain level balance

John Stultz (3):
  sched: Unnest ttwu_runnable in prep for proxy-execution
  sched: Fix runtime accounting w/ proxy-execution
  sched: Fixups to find_exec_ctx

Juri Lelli (2):
  locking/mutex: make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (6):
  sched: Unify runtime accounting across classes
  locking/ww_mutex: Remove wakeups from under mutex::wait_lock
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  sched: Split scheduler execution context
  sched: Add proxy execution

Valentin Schneider (1):
  sched/rt: Fix proxy/current (push,pull)ability

 include/linux/sched.h        |  10 +-
 include/linux/ww_mutex.h     |   3 +
 init/Kconfig                 |   7 +
 init/init_task.c             |   1 +
 kernel/Kconfig.locks         |   2 +-
 kernel/fork.c                |   6 +-
 kernel/locking/mutex-debug.c |   9 +-
 kernel/locking/mutex.c       | 113 ++++--
 kernel/locking/mutex.h       |  25 ++
 kernel/locking/ww_mutex.h    |  54 ++-
 kernel/sched/core.c          | 719 +++++++++++++++++++++++++++++++++--
 kernel/sched/cpudeadline.c   |  12 +-
 kernel/sched/cpudeadline.h   |   3 +-
 kernel/sched/cpupri.c        |  28 +-
 kernel/sched/cpupri.h        |   6 +-
 kernel/sched/deadline.c      | 187 +++++----
 kernel/sched/fair.c          |  99 +++--
 kernel/sched/rt.c            | 242 +++++++-----
 kernel/sched/sched.h         |  75 +++-
 kernel/sched/stop_task.c     |  13 +-
 20 files changed, 1284 insertions(+), 330 deletions(-)

-- 
2.41.0.rc0.172.g3f132b7071-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ