[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230601055846.2349566-1-jstultz@google.com>
Date: Thu, 1 Jun 2023 05:58:03 +0000
From: John Stultz <jstultz@...gle.com>
To: LKML <linux-kernel@...r.kernel.org>
Cc: John Stultz <jstultz@...gle.com>,
Joel Fernandes <joelaf@...gle.com>,
Qais Yousef <qyousef@...gle.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>,
Youssef Esmat <youssefesmat@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>,
Boqun Feng <boqun.feng@...il.com>,
"Paul E . McKenney" <paulmck@...nel.org>, kernel-team@...roid.com
Subject: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3
After having to catch up on other work after OSPM[1], I've finally
gotten back to focusing on Proxy Execution and wanted to send out this
next iteration of the patch series for review, testing, and feedback.
(Many thanks to folks who provided feedback on the last revision!)
As mentioned previously, this Proxy Execution series has a long history:
First described in a paper[2] by Watkins, Straub, Niehaus, then from
patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
for providing additional details here!)
So again, many thanks to those above, as all the credit for this series
really is due to them - while the mistakes are likely mine.
Overview:
—----------
Proxy Execution is a generalized form of priority inheritance. Classic
priority inheritance works well for real-time tasks where there is a
straight forward priority order to how things are run. But it breaks
down when used between CFS or DEADLINE tasks, as there are lots
of parameters involved outside of just the task’s nice value when
selecting the next task to run (via pick_next_task()). So ideally we
want to imbue the mutex holder with all the scheduler attributes of
the blocked waiting task.
Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked on, and
which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so that we
instead run the mutex holder.
The first of these is the most difficult to grasp (I do get the mental
friction here: blocked tasks on the *run*queue sounds like nonsense!
Personally I like to think of the runqueue in this model more like a
“task-selection queue”).
By leaving blocked tasks on the runqueue, we allow pick_next_task() to
choose the task that should run next (even if it’s blocked waiting on a
mutex). If we do select a blocked task, we look at the task’s blocked_on
mutex and from there look at the mutex’s owner task. And in the simple
case, the task which owns the mutex is what we then choose to run,
allowing it to release the mutex.
This means that instead of just tracking “curr”, the scheduler needs to
track both the scheduler context (what was picked and all the state used
for scheduling decisions), and the execution context (what we’re
running)
In this way, the mutex owner is run “on behalf” of the blocked task
that was picked to run, essentially inheriting the scheduler context of
the blocked task.
As Connor outlined in a previous submission of this patch series, this
raises a number of complicated situations: The mutex owner might itself
be blocked on another mutex, or it could be sleeping, running on a
different CPU, in the process of migrating between CPUs, etc.
But the functionality provided by Proxy Execution is useful, as in
Android we have a number of cases where we are seeing priority inversion
(not unbounded, but longer than we’d like) between “foreground” and
“background” SCHED_NORMAL applications, so having a generalized solution
would be very useful.
New in v4:
—------
* Fixed deadlock that was caused by wait/wound mutexes having circular
blocked_on references by clearing the blocked_on pointer on the task
we are waking to wound/die.
* Tried to resolve an issue Dietmar raised with RT balancing where the
proxy migration and push_rt_task() were fighting ping-ponging tasks
back and forth, caused by push_rt_task() migrating tasks even if they
were in the chain that ends with the current running task. Though this
likely needs more work, as this change resulted in different migration
quirks (see below).
* Fixed a number of null-pointer traversals that the changes were
occasionally tripping on
* Reworked patch that exposes __mutex_owner() to the scheduler to ensure
it doesn’t expose it any more than necessary, as suggested by Peter.
* To address some of Peter’s complaints, backed out the rq_curr()
wrapper, and reworked rq_selected() to be a macro to avoid needing
multiple accessors for READ_ONCE/rcu_dereference() type accesses.
* Removed verbose legacy comments from previous developers of the series
as Dietmar was finding them distracting when reviewing the diffs
(Though, to ensure I heed the warnings from previous experienced
travelers, I’ve preserved the comments/questions in a separate patch
in my own development tree).
* Dropped patch that added *_task_blocked_on() wrappers to check locking
correctness. Mostly as Peter didn’t seem happy with the wrappers in
other patches, but I still think it's useful for testing (and the
blocked_on locking still needs some work), so I’m carrying it in my
personal development tree.
Issues still to address:
—----------
* Occasional getting null scheduler entities from pick_next_entity() in
CFS. I’m a little stumped as to where this is going awry just yet, and
delayed sending this out, but figured it was worth getting it out for
review on the other issues while I chase this down.
* Better deadlock handling in proxy(): With the ww_mutex issues
resolved, we shouldn’t see circular blocked_on references, but a
number of the bugs I’ve been chasing recently come from getting stuck
with proxy() returning null forcing a reselection over and over. These
are still bugs to address, but my current thinking is that if we get
stuck like this, we can start to remove the selected mutex blocked
tasks from the rq, and let them be woken from the mutex waiters list
as is done currently? Thoughts here would be appreciated.
* More work on migration correctness (RT/DL load balancing,etc). I’m
still seeing occasional trouble as cpu counts go up which seems to be
due to a bunch of tasks being proxy migrated to a cpu, then having to
migrate them all away at once (seeing lots of pick again iterations).
This may actually be correct, due to chain migration, but it ends up
looking similar to a deadlock.
* “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of
a better name. Open to suggestions.
* As discussed at OSPM, I want to split pick_next_task() up into two
phases selecting and setting the next tasks, as currently
pick_next_task() assumes the returned task will be run which results
in various side-effects in sched class logic when it’s run. This
causes trouble should proxy() require us to re-select a task due to
migration or other edge cases.
* CFS load balancing. Blocked tasks may carry forward load (PELT) to the
lock owner's CPU, so CPU may look like it is overloaded.
* I still want to push down the split scheduler and execution context
awareness further through the scheduling code, as lots of logic still
assumes there’s only a single “rq->curr” task.
* Optimization to avoid migrating blocked tasks (allowing for optimistic
spinning) if the runnable lock-owner at the end of the blocked_on chain
is already running.
Performance:
—----------
This patch series switches mutexes to use handoff mode rather than
optimistic spinning. This is a potential concern where locks are under
high contention. However, so far in our initial performance analysis (on
both x86 and mobile devices) we’ve not seen major regressions. That
said, Chenyu did report a regression[3], which we’ll need to look
further into. As mentioned above, there may be some optimizations that
can help here, but my focus is on getting the code working well before I
spend time optimizing.
Review and feedback would be greatly appreciated!
If folks find it easier to test/tinker with, this patch series can also
be found here (along with some debug patches):
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v4-6.4-rc3
Thanks so much!
-john
[1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
Cc: Joel Fernandes <joelaf@...gle.com>
Cc: Qais Yousef <qyousef@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Juri Lelli <juri.lelli@...hat.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Valentin Schneider <vschneid@...hat.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Ben Segall <bsegall@...gle.com>
Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
Cc: Youssef Esmat <youssefesmat@...gle.com>
Cc: Mel Gorman <mgorman@...e.de>
Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
Cc: Will Deacon <will@...nel.org>
Cc: Waiman Long <longman@...hat.com>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: "Paul E . McKenney" <paulmck@...nel.org>
Cc: kernel-team@...roid.com
Connor O'Brien (1):
sched: Attempt to fix rt/dl load balancing via chain level balance
John Stultz (3):
sched: Unnest ttwu_runnable in prep for proxy-execution
sched: Fix runtime accounting w/ proxy-execution
sched: Fixups to find_exec_ctx
Juri Lelli (2):
locking/mutex: make mutex::wait_lock irq safe
locking/mutex: Expose __mutex_owner()
Peter Zijlstra (6):
sched: Unify runtime accounting across classes
locking/ww_mutex: Remove wakeups from under mutex::wait_lock
locking/mutex: Rework task_struct::blocked_on
locking/mutex: Add task_struct::blocked_lock to serialize changes to
the blocked_on state
sched: Split scheduler execution context
sched: Add proxy execution
Valentin Schneider (1):
sched/rt: Fix proxy/current (push,pull)ability
include/linux/sched.h | 10 +-
include/linux/ww_mutex.h | 3 +
init/Kconfig | 7 +
init/init_task.c | 1 +
kernel/Kconfig.locks | 2 +-
kernel/fork.c | 6 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 113 ++++--
kernel/locking/mutex.h | 25 ++
kernel/locking/ww_mutex.h | 54 ++-
kernel/sched/core.c | 719 +++++++++++++++++++++++++++++++++--
kernel/sched/cpudeadline.c | 12 +-
kernel/sched/cpudeadline.h | 3 +-
kernel/sched/cpupri.c | 28 +-
kernel/sched/cpupri.h | 6 +-
kernel/sched/deadline.c | 187 +++++----
kernel/sched/fair.c | 99 +++--
kernel/sched/rt.c | 242 +++++++-----
kernel/sched/sched.h | 75 +++-
kernel/sched/stop_task.c | 13 +-
20 files changed, 1284 insertions(+), 330 deletions(-)
--
2.41.0.rc0.172.g3f132b7071-goog
Powered by blists - more mailing lists