linux-kernel - Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy Execution v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c12889cf-d57e-90be-54a5-0329183817fb@amd.com>
Date:   Mon, 12 Jun 2023 22:51:17 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     John Stultz <jstultz@...gle.com>,
        LKML <linux-kernel@...r.kernel.org>
Cc:     Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qyousef@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Zimuzo Ezeozue <zezeozue@...gle.com>,
        Youssef Esmat <youssefesmat@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Paul E . McKenney" <paulmck@...nel.org>, kernel-team@...roid.com
Subject: Re: [PATCH v4 00/13] Generalized Priority Inheritance via Proxy
 Execution v3

Hello John,

On 6/1/2023 11:28 AM, John Stultz wrote:
> After having to catch up on other work after OSPM[1], I've finally
> gotten back to focusing on Proxy Execution and wanted to send out this
> next iteration of the patch series for review, testing, and feedback.
> (Many thanks to folks who provided feedback on the last revision!)
> 
> As mentioned previously, this Proxy Execution series has a long history:
> First described in a paper[2] by Watkins, Straub, Niehaus, then from
> patches from Peter Zijlstra, extended with lots of work by Juri Lelli,
> Valentin Schneider, and Connor O'Brien. (and thank you to Steven Rostedt
> for providing additional details here!)
> 
> So again, many thanks to those above, as all the credit for this series
> really is due to them - while the mistakes are likely mine.
> 
> Overview:
> —----------
> Proxy Execution is a generalized form of priority inheritance. Classic
> priority inheritance works well for real-time tasks where there is a
> straight forward priority order to how things are run. But it breaks
> down when used between CFS or DEADLINE tasks, as there are lots
> of parameters involved outside of just the task’s nice value when
> selecting the next task to run (via pick_next_task()).  So ideally we
> want to imbue the mutex holder with all the scheduler attributes of 
> the blocked waiting task.
> 
> Proxy Execution does this via a few changes:
> * Keeping tasks that are blocked on a mutex *on* the runqueue
> * Keeping additional tracking of which mutex a task is blocked on, and
>   which task holds a specific mutex.
> * Special handling for when we select a blocked task to run, so that we
>   instead run the mutex holder. 
> 
> The first of these is the most difficult to grasp (I do get the mental
> friction here: blocked tasks on the *run*queue sounds like nonsense!
> Personally I like to think of the runqueue in this model more like a
> “task-selection queue”).
> 
> By leaving blocked tasks on the runqueue, we allow pick_next_task() to
> choose the task that should run next (even if it’s blocked waiting on a
> mutex). If we do select a blocked task, we look at the task’s blocked_on
> mutex and from there look at the mutex’s owner task. And in the simple
> case, the task which owns the mutex is what we then choose to run,
> allowing it to release the mutex.
> 
> This means that instead of just tracking “curr”, the scheduler needs to
> track both the scheduler context (what was picked and all the state used
> for scheduling decisions), and the execution context (what we’re
> running)
> 
> In this way, the mutex owner is run “on behalf” of the blocked task
> that was picked to run, essentially inheriting the scheduler context of
> the blocked task.
> 
> As Connor outlined in a previous submission of this patch series,  this
> raises a number of complicated situations:  The mutex owner might itself
> be blocked on another mutex, or it could be sleeping, running on a
> different CPU, in the process of migrating between CPUs, etc.
> 
> But the functionality provided by Proxy Execution is useful, as in
> Android we have a number of cases where we are seeing priority inversion
> (not unbounded, but longer than we’d like) between “foreground” and
> “background” SCHED_NORMAL applications, so having a generalized solution
> would be very useful.
> 
> New in v4:
> —------
> * Fixed deadlock that was caused by wait/wound mutexes having circular
>   blocked_on references by clearing the blocked_on pointer on the task
>   we are waking to wound/die. 
> 
> * Tried to resolve an issue Dietmar raised with RT balancing where the
>   proxy migration and push_rt_task() were fighting ping-ponging tasks
>   back and forth, caused by push_rt_task() migrating tasks even if they
>   were in the chain that ends with the current running task. Though this
>   likely needs more work, as this change resulted in different migration
>   quirks (see below).
> 
> * Fixed a number of null-pointer traversals that the changes were
>   occasionally tripping on
> 
> * Reworked patch that exposes __mutex_owner() to the scheduler to ensure
>   it doesn’t expose it any more than necessary, as suggested by Peter. 
> 
> * To address some of Peter’s complaints, backed out the rq_curr()
>   wrapper, and reworked rq_selected() to be a macro to avoid needing
>   multiple accessors for READ_ONCE/rcu_dereference() type accesses. 
> 
> * Removed verbose legacy comments from previous developers of the series
>   as Dietmar was finding them distracting when reviewing the diffs
>   (Though, to ensure I heed the warnings from previous experienced
>   travelers, I’ve preserved the comments/questions in a separate patch
>   in my own development tree).
> 
> * Dropped patch that added *_task_blocked_on() wrappers to check locking
>   correctness. Mostly as Peter didn’t seem happy with the wrappers in
>   other patches, but I still think it's useful for testing (and the
>   blocked_on locking still needs some work), so I’m carrying it in my
>   personal development tree.
> 
> Issues still to address:
> —----------
> * Occasional getting null scheduler entities from pick_next_entity() in
>   CFS. I’m a little stumped as to where this is going awry just yet, and
>   delayed sending this out, but figured it was worth getting it out for
>   review on the other issues while I chase this down.

I'm consistently hitting the above issue early during the boot when I was
trying to test the series on a dual socket 3rd Generation EPYC platform
(2 x 64C/128T). Sharing the trace below:

    [   20.821808] ------------[ cut here ]------------
    [   20.826432] kernel BUG at kernel/sched/core.c:7462!
    [   20.831322] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [   20.836545] CPU: 250 PID: 0 Comm: swapper/250 Not tainted 6.4.0-rc4-proxy-execution-v4+ #474
    [   20.844976] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [   20.852544] RIP: 0010:__schedule+0x18b6/0x20a0
    [   20.856998] Code: 0f 85 51 04 00 00 83 ad 50 ff ff ff 01 0f 85 05 e9 ff ff f3 0f 1e fa 48 8b 35 0e 0c fe 00 48 c7 c7 33 a1 c1 85 e8 ca 37 23 ff <0f> 0b 4c 89 ff 4c 8b 6d 98 e8 1c 82 00 00 4c 89 f7 e8 14 82 00 00
    [   20.875744] RSP: 0018:ffffbd1e4d1d7dd0 EFLAGS: 00010082
    [   20.880970] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000005
    [   20.888104] RDX: ffff9d4d0006b000 RSI: 0000000000000200 RDI: ffff9d4d0004d400
    [   20.895235] RBP: ffffbd1e4d1d7e98 R08: 0000000000000024 R09: ffffffffff7edbd0
    [   20.902369] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d4d12e25a20
    [   20.909501] R13: ffff9dcbffab3840 R14: ffffbd1e4d1d7e50 R15: ffff9dcbff2b3840
    [   20.916632] FS:  0000000000000000(0000) GS:ffff9dcbffa80000(0000) knlGS:0000000000000000
    [   20.924709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   20.930449] CR2: 00007f92120c4800 CR3: 000000011477a002 CR4: 0000000000770ee0
    [   20.937581] PKRU: 55555554
    [   20.940292] Call Trace:
    [   20.942741]  <TASK>
    [   20.944845]  ? show_regs+0x6e/0x80
    [   20.948259]  ? die+0x3c/0xa0
    [   20.951146]  ? do_trap+0xd4/0xf0
    [   20.954377]  ? do_error_trap+0x75/0xa0
    [   20.958129]  ? __schedule+0x18b6/0x20a0
    [   20.961971]  ? exc_invalid_op+0x57/0x80
    [   20.965808]  ? __schedule+0x18b6/0x20a0
    [   20.969648]  ? asm_exc_invalid_op+0x1f/0x30
    [   20.973835]  ? __schedule+0x18b6/0x20a0
    [   20.977672]  ? cpuidle_enter_state+0xde/0x710
    [   20.982033]  schedule_idle+0x2e/0x50
    [   20.985614]  do_idle+0x15d/0x240
    [   20.988845]  cpu_startup_entry+0x24/0x30
    [   20.992772]  start_secondary+0x126/0x160
    [   20.996695]  secondary_startup_64_no_verify+0x10b/0x10b
    [   21.001924]  </TASK>
    [   21.004117] Modules linked in: sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore 
    ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 
    i2c_algo_bit drm_shmem_helper drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect aesni_intel sysimgblt crypto_simd crc32_pclmul cryptd crct10dif_pclmul sha512_ssse3 xhci_pci tg3 drm 
    xhci_pci_renesas megaraid_sas wmi
    [   21.055707] Dumping ftrace buffer:
    [   21.059291] ---------------------------------
    [   21.063697]   <idle>-0       250dn.2. 21175635us : __schedule: JDB: BUG!!! pick next retry_count > 50
    [   21.072915] ---------------------------------
    [   21.077282] ---[ end trace 0000000000000000 ]---

    $ sed -n 7460,7462p kernel/sched/core.c
      if (retry_count++ > 50) {
              trace_printk("JDB: BUG!!! pick next retry_count > 50\n");
              BUG();

Hope it helps during the debug. If you have a fix in mind that you
would like me to test, please do let me know.

> 
> * Better deadlock handling in proxy(): With the ww_mutex issues
>   resolved, we shouldn’t see circular blocked_on references, but a
>   number of the bugs I’ve been chasing recently come from getting stuck
>   with proxy() returning null forcing a reselection over and over. These
>   are still bugs to address, but my current thinking is that if we get
>   stuck like this, we can start to remove the selected mutex blocked
>   tasks from the rq, and let them be woken from the mutex waiters list
>   as is done currently? Thoughts here would be appreciated.
> 
> * More work on migration correctness (RT/DL load balancing,etc). I’m
>   still seeing occasional trouble as cpu counts go up which seems to be
>   due to a bunch of tasks being proxy migrated to a cpu, then having to
>   migrate them all away at once (seeing lots of pick again iterations).
>   This may actually be correct, due to chain migration, but it ends up
>   looking similar to a deadlock.
> 
> * “rq_selected()” naming. Peter doesn’t like it, but I’ve not thought of
>   a better name. Open to suggestions.
> 
> * As discussed at OSPM, I want to split pick_next_task() up into two
>   phases selecting and setting the next tasks, as currently
>   pick_next_task() assumes the returned task will be run which results
>   in various side-effects in sched class logic when it’s run. This
>   causes trouble should proxy() require us to re-select a task due to
>   migration or other edge cases.
> 
> * CFS load balancing. Blocked tasks may carry forward load (PELT) to the
>   lock owner's CPU, so CPU may look like it is overloaded.
> 
> * I still want to push down the split scheduler and execution context
>   awareness further through the scheduling code, as lots of logic still
>   assumes there’s only a single “rq->curr” task.
> 
> * Optimization to avoid migrating blocked tasks (allowing for optimistic
>   spinning) if the runnable lock-owner at the end of the blocked_on chain
>   is already running.  
> 
> 
> Performance:
> —----------
> This patch series switches mutexes to use handoff mode rather than
> optimistic spinning. This is a potential concern where locks are under
> high contention. However, so far in our initial performance analysis (on
> both x86 and mobile devices) we’ve not seen major regressions. That
> said, Chenyu did report a regression[3], which we’ll need to look
> further into. As mentioned above, there may be some optimizations that
> can help here, but my focus is on getting the code working well before I
> spend time optimizing.
> 
> Review and feedback would be greatly appreciated!
> 
> If folks find it easier to test/tinker with, this patch series can also
> be found here (along with some debug patches):
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v4-6.4-rc3

I'm using the same tree with the HEAD at commit 821c8a48233f ("HACK:
sched: Add BUG_ONs for proxy related loops")

> 
> Thanks so much!
> -john
> 
> [1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
> [2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
> [3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/
> 
> Cc: Joel Fernandes <joelaf@...gle.com>
> Cc: Qais Yousef <qyousef@...gle.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
> Cc: Youssef Esmat <youssefesmat@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Waiman Long <longman@...hat.com>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: "Paul E . McKenney" <paulmck@...nel.org>
> Cc: kernel-team@...roid.com
> 
> Connor O'Brien (1):
>   sched: Attempt to fix rt/dl load balancing via chain level balance
> 
> John Stultz (3):
>   sched: Unnest ttwu_runnable in prep for proxy-execution
>   sched: Fix runtime accounting w/ proxy-execution
>   sched: Fixups to find_exec_ctx
> 
> Juri Lelli (2):
>   locking/mutex: make mutex::wait_lock irq safe
>   locking/mutex: Expose __mutex_owner()
> 
> Peter Zijlstra (6):
>   sched: Unify runtime accounting across classes
>   locking/ww_mutex: Remove wakeups from under mutex::wait_lock
>   locking/mutex: Rework task_struct::blocked_on
>   locking/mutex: Add task_struct::blocked_lock to serialize changes to
>     the blocked_on state
>   sched: Split scheduler execution context
>   sched: Add proxy execution
> 
> Valentin Schneider (1):
>   sched/rt: Fix proxy/current (push,pull)ability
> 
>  include/linux/sched.h        |  10 +-
>  include/linux/ww_mutex.h     |   3 +
>  init/Kconfig                 |   7 +
>  init/init_task.c             |   1 +
>  kernel/Kconfig.locks         |   2 +-
>  kernel/fork.c                |   6 +-
>  kernel/locking/mutex-debug.c |   9 +-
>  kernel/locking/mutex.c       | 113 ++++--
>  kernel/locking/mutex.h       |  25 ++
>  kernel/locking/ww_mutex.h    |  54 ++-
>  kernel/sched/core.c          | 719 +++++++++++++++++++++++++++++++++--
>  kernel/sched/cpudeadline.c   |  12 +-
>  kernel/sched/cpudeadline.h   |   3 +-
>  kernel/sched/cpupri.c        |  28 +-
>  kernel/sched/cpupri.h        |   6 +-
>  kernel/sched/deadline.c      | 187 +++++----
>  kernel/sched/fair.c          |  99 +++--
>  kernel/sched/rt.c            | 242 +++++++-----
>  kernel/sched/sched.h         |  75 +++-
>  kernel/sched/stop_task.c     |  13 +-
>  20 files changed, 1284 insertions(+), 330 deletions(-)
> 

--
Thanks and Regards,
Prateek