lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231106193524.866104-1-jstultz@google.com>
Date:   Mon,  6 Nov 2023 19:34:43 +0000
From:   John Stultz <jstultz@...gle.com>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     John Stultz <jstultz@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qyousef@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Zimuzo Ezeozue <zezeozue@...gle.com>,
        Youssef Esmat <youssefesmat@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Paul E . McKenney" <paulmck@...nel.org>, kernel-team@...roid.com
Subject: [PATCH v6 00/20] Proxy Execution: A generalized form of Priority
 Inheritance v6

Stabilizing this Proxy Execution series has unfortunately
continued to be a challenging task. Since the v5 release, I’ve
been focused on getting the deactivated/sleeping owner enqueuing
functionality, which I left out of v5, stabilized. I’ve managed
to rework enough to avoid the crashes previously tripped with the
locktorture & ww_mutex selftests, so I feel it’s much improved,
but I do still see some issues (blocked waitqueues and hung task
watchdogs firing) after stressing the system for many hours in a
64 cpu qemu environment (though this issue seems to be introduced
earlier in the series with proxy-migration/return-migration).

I still haven’t had time to focus on testing and debugging the
chain migration patches. So I’ve left that patch out of this
submission for now, but will try to get it included in the next
revision.

This patch series is actually coarser than what I’ve been
developing with, as there are a number of small “test” steps to
help validate behavior I changed, which would then be replaced by
the real logic afterwards. Including those here would just cause
more work for reviewers, so I’ve folded them together, but if
you’re interested you can find the fine-grained tree here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6-fine-grained
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6-fine-grained

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[1] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
—----------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()).  So ideally we want to imbue the mutex holder
with all the scheduler attributes of  the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
  on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
  that we instead run the mutex holder. 

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations:  The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

The complexity from this is imposing, but currently in Android we
have a large number of cases where we are seeing priority
inversion (not unbounded, but much longer than we’d like) between
“foreground” and “background” SCHED_NORMAL applications. As a
result, currently we cannot usefully limit background activity
without introducing inconsistent behavior. So Proxy Execution is
a very attractive approach to resolving this critical issue.


New in v6:
---------
* Big rework of blocked owner enqueuing, re-adding the
  functionality to the series.

* Added a boot option proxy_exec= and additional logic to allow
  the functionality to be boot time toggled.

* Minor tweaks to wake_q logic suggested by Waiman Long

* Minor fixups Reported-by: kernel test robot <lkp@...el.com>

* Various cleanups, optimizations as well as fixes for bugs found in v5

Also, I know Peter didn’t like the blocked_on wrappers, so I
previously dropped them, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.


Issues still to address:
—----------
* As already mentioned, the sleeping owner handling (where we
  deactivate waiting tasks and enqueue them onto a list, then
  reactivate them when the owner wakes up) has been very
  difficult to get right. This is in part because when we want
  to activate tasks, we’re already holding a task.pi_lock and a
  rq_lock, just not the locks for the task we’re activating, nor
  the rq we’re  enqueuing it onto. So there has to be a bit of
  lock juggling to drop and acquire the right locks (in the right
  order).

* Similarly as we’re often dealing with lists of tasks or chains
  of tasks and mutexes, iterating across these chains of objects
  can be done safely while holding the rq lock, but as these
  chains can cross runqueues our ability to traverse the chains
  safely is somewhat limited. 

* Additionally, in the sleeping owner enqueueing as well as in
  proxy return migration, we seem to be constantly working
  against the proper locking order (task.pi_lock->rq lock
  ->mutex.waitlock -> task.blocked_lock), having to repeatedly
  “swim upstream” where we need a higher order lock then what
  we’re already holding. Suggestions for different approaches to
  the serialization or ways to move the logic outside of where
  locks are already held would be appreciated.

* Still need to validate and re-add the chain migration patches
  to ensure they resolve the RT/DL load balancing invariants.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
  thought of a better name. Open to suggestions.

* As discussed at OSPM[2], I like to split pick_next_task() up
  into two phases selecting and setting the next tasks, as
  currently pick_next_task() assumes the returned task will be
  run which results in various side-effects in sched class logic
  when it’s run. I tried to take a pass at this earlier, but it’s
  hairy and lower on the priority list for now.

* CFS load balancing. Blocked tasks may carry forward load (PELT)
  to the lock owner's CPU, so CPU may look like it is overloaded.

* Optimization to avoid migrating blocked tasks (to preserve
  optimistic mutex spinning) if the runnable lock-owner at the
  end of the blocked_on chain is already running (though this is
  difficult due to the limitations from locking rules needed to
  traverse the blocked on chain across run queues).


Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow vs v4, suggesting the changes
to return-migration (and extra locking) have some impact. With v6
the extra overhead is reduced but still not as nice as v4. I’ll
be digging more there, but my priority is still stability over
speed at this point (it’s easier to validate correctness of
optimizations if the baseline isn’t crashing).


If folks find it easier to test/tinker with, this patch series
can also be found here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v6-6.6
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v6-6.6

Awhile back Sage Sharp had a nice blog post about types of
reviews [4], and while any review and feedback would be greatly
appreciated, those focusing on conceptual design or correctness
issues would be *especially* valued.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[2] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
[4] https://sage.thesharps.us/2014/09/01/the-gentle-art-of-patch-review/


Cc: Joel Fernandes <joelaf@...gle.com>
Cc: Qais Yousef <qyousef@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
Cc: Youssef Esmat <youssefesmat@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: Will Deacon <will@...nel.org>
Cc: Waiman Long <longman@...hat.com>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: "Paul E . McKenney" <paulmck@...nel.org>
Cc: kernel-team@...roid.com


John Stultz (8):
  locking/mutex: Removes wakeups from under mutex::wait_lock
  sched: Add CONFIG_PROXY_EXEC & boot argument to enable/disable
  locking/mutex: Split blocked_on logic into two states (blocked_on and
    blocked_on_waking)
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Split out __sched() deactivate task logic into a helper
  sched: Add a very simple proxy() function
  sched: Add proxy deactivate helper
  sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
  locking/mutex: make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (8):
  sched: Unify runtime accounting across classes
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
  sched: Split scheduler execution context
  sched: Start blocked_on chain processing in proxy()
  sched: Add blocked_donor link to task for smarter mutex handoffs
  sched: Add deactivated (sleeping) owner handling to proxy()

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   4 +
 include/linux/sched.h                         |  49 +-
 init/Kconfig                                  |   7 +
 init/init_task.c                              |   1 +
 kernel/Kconfig.locks                          |   2 +-
 kernel/fork.c                                 |   8 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        | 163 ++--
 kernel/locking/mutex.h                        |  25 +
 kernel/locking/ww_mutex.h                     |  72 +-
 kernel/sched/core.c                           | 723 ++++++++++++++++--
 kernel/sched/deadline.c                       |  50 +-
 kernel/sched/fair.c                           | 104 ++-
 kernel/sched/rt.c                             |  77 +-
 kernel/sched/sched.h                          |  61 +-
 kernel/sched/stop_task.c                      |  13 +-
 16 files changed, 1117 insertions(+), 251 deletions(-)

-- 
2.42.0.869.gea05f2083d-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ