linux-kernel - Re: [PATCH v18 0/8] Single RunQueue Proxy Execution (v18)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a8de5df6-665d-4c97-aff9-854ccc49adfc@amd.com>
Date: Fri, 27 Jun 2025 08:34:49 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>
Cc: Joel Fernandes <joelagnelf@...dia.com>, Qais Yousef
 <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Valentin Schneider <vschneid@...hat.com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
 Boqun Feng <boqun.feng@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>,
 Metin Kaya <Metin.Kaya@....com>, Xuewen Yan <xuewen.yan94@...il.com>,
 Thomas Gleixner <tglx@...utronix.de>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Suleiman Souhlal <suleiman@...gle.com>, kuyo chang
 <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, kernel-team@...roid.com
Subject: Re: [PATCH v18 0/8] Single RunQueue Proxy Execution (v18)

Hello John,

On 6/26/2025 2:00 AM, John Stultz wrote:
> Hey All,
> 
> After not getting much response from the v17 series (and
> resending it), I was going to continue to just iterate resending
> the v17 single runqueue focused series. However, Suleiman had a
> very good suggestion for improving the larger patch series and a
> few of the tweaks for those changes trickled back into the set
> I’m submitting here.
> 
> Unfortunately those later changes also uncovered some stability
> problems with the full proxy-exec patch series, which took a
> painfully long time (stress testing taking 30-60 hours to trip
> the problem) to resolve. However, after finally sorting those
> issues out it has been running well, so I can now send out the
> next revision (v18) of the set.
> 
> So here is v18 of the Proxy Execution series, a generalized form
> of priority inheritance.

Sorry for the lack of response on the previous version but here
are the test results for v18.

tl;dr I don't see anything major. Few regressions I see are for
data points with lot of deviation so I think  they can be safely
ignored.

Full results are below:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	    tip:sched/urgentat commit 914873bc7df9 ("Merge tag
             'x86-build-2025-05-25' of
             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

proxy_exec: tip + this series as is with CONFIG_SCHED_PROXY_EXEC=y

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)      proxy_exec[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 8.80)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -4.45]( 6.58)
      4-groups     1.00 [ -0.00]( 2.10)     1.02 [ -2.17]( 1.85)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 1.47)
     16-groups     1.00 [ -0.00]( 1.10)     1.00 [  0.42]( 1.23)
     
     
     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)      proxy_exec[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.02 [  1.78]( 1.06)
         2     1.00 [  0.00]( 1.13)     1.03 [  3.30]( 1.05)
         4     1.00 [  0.00]( 1.12)     1.02 [  1.86]( 1.05)
         8     1.00 [  0.00]( 0.93)     1.02 [  1.74]( 0.72)
        16     1.00 [  0.00]( 0.38)     1.02 [  2.28]( 1.35)
        32     1.00 [  0.00]( 0.66)     1.01 [  1.44]( 0.85)
        64     1.00 [  0.00]( 1.18)     1.02 [  1.98]( 1.28)
       128     1.00 [  0.00]( 1.12)     1.00 [  0.31]( 0.89)
       256     1.00 [  0.00]( 0.42)     1.00 [ -0.49]( 0.91)
       512     1.00 [  0.00]( 0.14)     1.01 [  0.94]( 0.33)
      1024     1.00 [  0.00]( 0.26)     1.01 [  0.95]( 0.24)
     
     
     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)      proxy_exec[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.98 [ -2.35]( 8.36)
     Scale     1.00 [  0.00]( 2.85)     0.93 [ -7.21]( 7.24)
       Add     1.00 [  0.00]( 3.39)     0.93 [ -7.50]( 6.56)
     Triad     1.00 [  0.00]( 6.39)     1.04 [  4.18]( 7.77)
     
     
     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)      proxy_exec[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     1.02 [  2.00]( 2.92)
     Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.58]( 3.88)
       Add     1.00 [  0.00]( 4.14)     1.02 [  1.96]( 1.71)
     Triad     1.00 [  0.00]( 1.00)     0.99 [ -0.50]( 2.43)
     
     
     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)      proxy_exec[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.02 [  2.40]( 0.32)
      2-clients     1.00 [  0.00]( 0.58)     1.02 [  2.21]( 0.30)
      4-clients     1.00 [  0.00]( 0.35)     1.02 [  2.20]( 0.63)
      8-clients     1.00 [  0.00]( 0.48)     1.02 [  1.98]( 0.50)
     16-clients     1.00 [  0.00]( 0.66)     1.02 [  2.19]( 0.49)
     32-clients     1.00 [  0.00]( 1.15)     1.02 [  2.17]( 0.75)
     64-clients     1.00 [  0.00]( 1.38)     1.01 [  1.43]( 1.39)
     128-clients    1.00 [  0.00]( 0.87)     1.01 [  0.60]( 1.09)
     256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.54]( 4.29)
     512-clients    1.00 [  0.00](54.39)     0.99 [ -0.61](52.23)
     
     
     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.76 [ 23.91](23.47)
       2     1.00 [ -0.00]( 1.15)     0.90 [ 10.00]( 8.11)
       4     1.00 [ -0.00](13.46)     1.10 [-10.42](10.94)
       8     1.00 [ -0.00]( 7.14)     0.89 [ 10.53]( 3.92)
      16     1.00 [ -0.00]( 3.49)     1.00 [ -0.00]( 8.93)
      32     1.00 [ -0.00]( 1.06)     0.96 [  4.26](10.99)
      64     1.00 [ -0.00]( 5.48)     1.08 [ -8.14]( 4.03)
     128     1.00 [ -0.00](10.45)     1.09 [ -8.64](13.37)
     256     1.00 [ -0.00](31.14)     1.12 [-11.66](16.77)
     512     1.00 [ -0.00]( 1.52)     0.98 [  2.02]( 1.50)
     
     
     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [ -0.29]( 0.53)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.30)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     1.03 [  3.50]( 0.27)
      64     1.00 [  0.00]( 1.05)     1.00 [ -0.38]( 4.45)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.19)
     256     1.00 [  0.00]( 0.72)     0.99 [ -0.61]( 0.63)
     512     1.00 [  0.00]( 0.57)     1.00 [ -0.24]( 0.33)
     
     
     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.81 [ 18.75](10.25)
       2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.08)
       4     1.00 [ -0.00]( 3.78)     1.29 [-28.57](17.25)
       8     1.00 [ -0.00]( 0.00)     1.17 [-16.67]( 3.60)
      16     1.00 [ -0.00]( 7.56)     1.00 [ -0.00]( 6.88)
      32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 0.00)
      64     1.00 [ -0.00]( 9.63)     0.95 [  5.00]( 7.32)
     128     1.00 [ -0.00]( 4.86)     0.96 [  3.52]( 8.69)
     256     1.00 [ -0.00]( 2.34)     0.95 [  4.70]( 2.78)
     512     1.00 [ -0.00]( 0.40)     0.99 [  0.77]( 0.20)
     
     
     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)      proxy_exec[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     1.02 [ -1.82]( 3.15)
       2     1.00 [ -0.00]( 0.87)     1.02 [ -2.16]( 1.90)
       4     1.00 [ -0.00]( 1.21)     1.04 [ -3.77]( 2.76)
       8     1.00 [ -0.00]( 0.27)     1.01 [ -1.31]( 2.01)
      16     1.00 [ -0.00]( 4.04)     1.00 [  0.27]( 0.77)
      32     1.00 [ -0.00]( 7.35)     0.89 [ 11.07]( 1.68)
      64     1.00 [ -0.00]( 3.54)     1.02 [ -1.55]( 1.47)
     128     1.00 [ -0.00]( 0.37)     1.00 [  0.41]( 0.11)
     256     1.00 [ -0.00]( 9.57)     0.91 [  8.84]( 3.64)
     512     1.00 [ -0.00]( 1.82)     1.02 [ -1.93]( 1.21)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra               0.82%
     ycsb-mongodb                -0.45%
     deathstarbench-1x            2.44%
     deathstarbench-2x            1.88%
     deathstarbench-3x            0.09%
     deathstarbench-6x            1.94%
     hammerdb+mysql 16VU          3.65%
     hammerdb+mysql 64VU         -0.59%


> 
> As I’m trying to submit this work in smallish digestible pieces,
> in this series, I’m only submitting for review the logic that
> allows us to do the proxying if the lock owner is on the same
> runqueue as the blocked waiter: Introducing the
> CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
> task_struct::blocked_on pointer and wrapper functions, the
> initial sketch of the find_proxy_task() logic, some fixes for
> using split contexts, and finally same-runqueue proxying.
> 
> As I mentioned above, for the series I’m submitting here, it has
> only barely changed from v17. With the main difference being
> slightly different order of checks for cases where we don’t
> actually do anything yet (more on why below), and use of
> READ_ONCE for the on_rq reads to avoid the compiler fusing
> loads, which I was bitten by with the full series.

For this series (Single RunQueue Proxy), feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

I'll go and test the full series next and reply with the
results on this same thread sometime next week. Meanwhile I'll
try to queue a longer locktorture run over the weekend. I'll
let you know if I see anything out of the ordinary on my setup.

-- 
Thanks and Regards,
Prateek