linux-kernel - Re: [PATCH rcu 0/11] Add light-weight readers for SRCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9366504-d3d7-43e3-96c5-8053129c0794@paulmck-laptop>
Date: Tue, 3 Sep 2024 15:20:49 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: rcu@...r.kernel.org, linux-kernel@...r.kernel.org, kernel-team@...a.com,
	rostedt@...dmis.org, Alexei Starovoitov <ast@...nel.org>,
	Andrii Nakryiko <andrii@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Kent Overstreet <kent.overstreet@...ux.dev>, bpf@...r.kernel.org
Subject: Re: [PATCH rcu 0/11] Add light-weight readers for SRCU

On Tue, Sep 03, 2024 at 03:08:21PM -0700, Andrii Nakryiko wrote:
> On Tue, Sep 3, 2024 at 9:32 AM Paul E. McKenney <paulmck@...nel.org> wrote:
> >
> > Hello!
> >
> > This series provides light-weight readers for SRCU.  This lightness
> > is selected by the caller by using the new srcu_read_lock_lite() and
> > srcu_read_unlock_lite() flavors instead of the usual srcu_read_lock() and
> > srcu_read_unlock() flavors.  Although this passes significant rcutorture
> > testing, this should still be considered to be experimental.
> >
> > There are a few restrictions:  (1) If srcu_read_lock_lite() is called
> > on a given srcu_struct structure, then no other flavor may be used on
> > that srcu_struct structure, before, during, or after.  (2) The _lite()
> > readers may only be invoked from regions of code where RCU is watching
> > (as in those regions in which rcu_is_watching() returns true).  (3)
> > There is no auto-expediting for srcu_struct structures that have
> > been passed to _lite() readers.  (4) SRCU grace periods for _lite()
> > srcu_struct structures invoke synchronize_rcu() at least twice, thus
> > having longer latencies than their non-_lite() counterparts.  (5) Even
> > with synchronize_srcu_expedited(), the resulting SRCU grace period
> > will invoke synchronize_rcu() at least twice, as opposed to invoking
> > the IPI-happy synchronize_rcu_expedited() function.  (6)  Just as with
> > srcu_read_lock() and srcu_read_unlock(), the srcu_read_lock_lite() and
> > srcu_read_unlock_lite() functions may not (repeat, *not*) be invoked
> > from NMI handlers (that is what the _nmisafe() interface are for).
> > Although one could imagine readers that were both _lite() and _nmisafe(),
> > one might also imagine that the read-modify-write atomic operations that
> > are needed by any NMI-safe SRCU read marker would make this unhelpful
> > from a performance perspective.
> >
> > All that said, the patches in this series are as follows:
> >
> > 1.      Rename srcu_might_be_idle() to srcu_should_expedite().
> >
> > 2.      Introduce srcu_gp_is_expedited() helper function.
> >
> > 3.      Renaming in preparation for additional reader flavor.
> >
> > 4.      Bit manipulation changes for additional reader flavor.
> >
> > 5.      Standardize srcu_data pointers to "sdp" and similar.
> >
> > 6.      Convert srcu_data ->srcu_reader_flavor to bit field.
> >
> > 7.      Add srcu_read_lock_lite() and srcu_read_unlock_lite().
> >
> > 8.      rcutorture: Expand RCUTORTURE_RDR_MASK_[12] to eight bits.
> >
> > 9.      rcutorture: Add reader_flavor parameter for SRCU readers.
> >
> > 10.     rcutorture: Add srcu_read_lock_lite() support to
> >         rcutorture.reader_flavor.
> >
> > 11.     refscale: Add srcu_read_lock_lite() support using "srcu-lite".
> >
> >                                                 Thanx, Paul
> >
> 
> Thanks Paul for working on this!
> 
> I applied your patches on top of all my uprobe changes (including the
> RFC patches that remove locks, optimize VMA to inode resolution, etc,
> etc; basically the fastest uprobe/uretprobe state I can get to). And
> then tested a few changes:
> 
>   - A) baseline (no SRCU-lite, RCU Tasks Trace for uprobe, normal SRCU
> for uretprobes)
>   - B) A + SRCU-lite for uretprobes (i.e., SRCU to SRCU-lite conversion)
>   - C) B + RCU Tasks Trace converted to SRCU-lite
>   - D) I also pessimized baseline by reverting RCU Tasks Trace, so
> both uprobes and uretprobes are SRCU protected. This allowed me to see
> a pure gain of SRCU-lite over SRCU for uprobes, taking RCU Tasks Trace
> performance out of the equation.
> 
> In uprobes I used basically two benchmarks. One, uprobe-nop, that
> benchmarks entry uprobes (which are the fastest most optimized case,
> using RCU Tasks Trace in A and SRCU in D), and another that benchmarks
> return uprobes (uretprobes), called uretprobe-nop, which is normal
> SRCU both in A) and D). The latter uretprobe-nop benchmark basically
> combines entry and return probe overheads, because that's how
> uretprobes work.
> 
> So, below are the most meaningful comparisons. First, SRCU vs
> SRCU-lite for uretprobes:
> 
> BASELINE (A)
> ============
> uretprobe-nop         ( 1 cpus):    1.941 ± 0.002M/s  (  1.941M/s/cpu)
> uretprobe-nop         ( 2 cpus):    3.731 ± 0.001M/s  (  1.866M/s/cpu)
> uretprobe-nop         ( 3 cpus):    5.492 ± 0.002M/s  (  1.831M/s/cpu)
> uretprobe-nop         ( 4 cpus):    7.234 ± 0.003M/s  (  1.808M/s/cpu)
> uretprobe-nop         ( 8 cpus):   13.448 ± 0.098M/s  (  1.681M/s/cpu)
> uretprobe-nop         (16 cpus):   22.905 ± 0.009M/s  (  1.432M/s/cpu)
> uretprobe-nop         (32 cpus):   44.760 ± 0.069M/s  (  1.399M/s/cpu)
> uretprobe-nop         (40 cpus):   52.986 ± 0.104M/s  (  1.325M/s/cpu)
> uretprobe-nop         (64 cpus):   43.650 ± 0.435M/s  (  0.682M/s/cpu)
> uretprobe-nop         (80 cpus):   46.831 ± 0.938M/s  (  0.585M/s/cpu)
> 
> SRCU-lite for uretprobe (B)
> ===========================
> uretprobe-nop         ( 1 cpus):    2.014 ± 0.014M/s  (  2.014M/s/cpu)
> uretprobe-nop         ( 2 cpus):    3.820 ± 0.002M/s  (  1.910M/s/cpu)
> uretprobe-nop         ( 3 cpus):    5.640 ± 0.003M/s  (  1.880M/s/cpu)
> uretprobe-nop         ( 4 cpus):    7.410 ± 0.003M/s  (  1.852M/s/cpu)
> uretprobe-nop         ( 8 cpus):   13.877 ± 0.009M/s  (  1.735M/s/cpu)
> uretprobe-nop         (16 cpus):   23.372 ± 0.022M/s  (  1.461M/s/cpu)
> uretprobe-nop         (32 cpus):   45.748 ± 0.048M/s  (  1.430M/s/cpu)
> uretprobe-nop         (40 cpus):   54.327 ± 0.093M/s  (  1.358M/s/cpu)
> uretprobe-nop         (64 cpus):   43.672 ± 0.371M/s  (  0.682M/s/cpu)
> uretprobe-nop         (80 cpus):   47.470 ± 0.753M/s  (  0.593M/s/cpu)
> 
> You can see that across the board (except for noisy 64 CPU case)
> SRCU-lite is faster.
> 
> 
> Now, comparing A) vs C) on uprobe-nop, so we can see RCU Tasks Trace
> vs SRCU-lite for uprobes.
> 
> BASELINE (A)
> ============
> uprobe-nop            ( 1 cpus):    3.574 ± 0.004M/s  (  3.574M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.735 ± 0.006M/s  (  3.368M/s/cpu)
> uprobe-nop            ( 3 cpus):   10.102 ± 0.005M/s  (  3.367M/s/cpu)
> uprobe-nop            ( 4 cpus):   13.087 ± 0.008M/s  (  3.272M/s/cpu)
> uprobe-nop            ( 8 cpus):   24.622 ± 0.031M/s  (  3.078M/s/cpu)
> uprobe-nop            (16 cpus):   41.752 ± 0.020M/s  (  2.610M/s/cpu)
> uprobe-nop            (32 cpus):   84.973 ± 0.115M/s  (  2.655M/s/cpu)
> uprobe-nop            (40 cpus):  102.229 ± 0.030M/s  (  2.556M/s/cpu)
> uprobe-nop            (64 cpus):  125.537 ± 0.045M/s  (  1.962M/s/cpu)
> uprobe-nop            (80 cpus):  143.091 ± 0.044M/s  (  1.789M/s/cpu)
> 
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop            ( 1 cpus):    3.446 ± 0.010M/s  (  3.446M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.411 ± 0.003M/s  (  3.206M/s/cpu)
> uprobe-nop            ( 3 cpus):    9.563 ± 0.039M/s  (  3.188M/s/cpu)
> uprobe-nop            ( 4 cpus):   12.454 ± 0.016M/s  (  3.113M/s/cpu)
> uprobe-nop            ( 8 cpus):   23.172 ± 0.013M/s  (  2.897M/s/cpu)
> uprobe-nop            (16 cpus):   39.793 ± 0.005M/s  (  2.487M/s/cpu)
> uprobe-nop            (32 cpus):   79.616 ± 0.207M/s  (  2.488M/s/cpu)
> uprobe-nop            (40 cpus):   96.851 ± 0.128M/s  (  2.421M/s/cpu)
> uprobe-nop            (64 cpus):  119.432 ± 0.146M/s  (  1.866M/s/cpu)
> uprobe-nop            (80 cpus):  135.162 ± 0.207M/s  (  1.690M/s/cpu)
> 
> 
> Overall, RCU Tasks Trace beats SRCU-lite, which I think is expected,
> so consider this just a confirmation. I'm not sure I'd like to switch
> from RCU Tasks Trace to SRCU-lite for uprobes part, but at least we
> have numbers to make that decision.
> 
> Finally, to see SRCU vs SRCU-lite for entry uprobes improvements
> (i.e., if we never had RCU Tasks Trace). I've included a bit more
> extensive set of CPU counts for completeness.
> 
> BASELINE w/ SRCU for uprobes (D)
> ================================
> uprobe-nop            ( 1 cpus):    3.413 ± 0.003M/s  (  3.413M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.305 ± 0.003M/s  (  3.153M/s/cpu)
> uprobe-nop            ( 3 cpus):    9.442 ± 0.018M/s  (  3.147M/s/cpu)
> uprobe-nop            ( 4 cpus):   12.253 ± 0.006M/s  (  3.063M/s/cpu)
> uprobe-nop            ( 5 cpus):   15.316 ± 0.007M/s  (  3.063M/s/cpu)
> uprobe-nop            ( 6 cpus):   18.287 ± 0.030M/s  (  3.048M/s/cpu)
> uprobe-nop            ( 7 cpus):   21.378 ± 0.025M/s  (  3.054M/s/cpu)
> uprobe-nop            ( 8 cpus):   23.044 ± 0.010M/s  (  2.881M/s/cpu)
> uprobe-nop            (10 cpus):   28.778 ± 0.012M/s  (  2.878M/s/cpu)
> uprobe-nop            (12 cpus):   31.300 ± 0.016M/s  (  2.608M/s/cpu)
> uprobe-nop            (14 cpus):   36.580 ± 0.007M/s  (  2.613M/s/cpu)
> uprobe-nop            (16 cpus):   38.848 ± 0.017M/s  (  2.428M/s/cpu)
> uprobe-nop            (24 cpus):   60.298 ± 0.080M/s  (  2.512M/s/cpu)
> uprobe-nop            (32 cpus):   77.137 ± 1.957M/s  (  2.411M/s/cpu)
> uprobe-nop            (40 cpus):   89.205 ± 1.278M/s  (  2.230M/s/cpu)
> uprobe-nop            (48 cpus):   99.207 ± 0.444M/s  (  2.067M/s/cpu)
> uprobe-nop            (56 cpus):  102.399 ± 0.484M/s  (  1.829M/s/cpu)
> uprobe-nop            (64 cpus):  115.390 ± 0.972M/s  (  1.803M/s/cpu)
> uprobe-nop            (72 cpus):  127.476 ± 0.050M/s  (  1.770M/s/cpu)
> uprobe-nop            (80 cpus):  137.304 ± 0.068M/s  (  1.716M/s/cpu)
> 
> SRCU-lite for uprobes (C)
> =========================
> uprobe-nop            ( 1 cpus):    3.446 ± 0.010M/s  (  3.446M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.411 ± 0.003M/s  (  3.206M/s/cpu)
> uprobe-nop            ( 3 cpus):    9.563 ± 0.039M/s  (  3.188M/s/cpu)
> uprobe-nop            ( 4 cpus):   12.454 ± 0.016M/s  (  3.113M/s/cpu)
> uprobe-nop            ( 5 cpus):   15.634 ± 0.008M/s  (  3.127M/s/cpu)
> uprobe-nop            ( 6 cpus):   18.443 ± 0.018M/s  (  3.074M/s/cpu)
> uprobe-nop            ( 7 cpus):   21.793 ± 0.057M/s  (  3.113M/s/cpu)
> uprobe-nop            ( 8 cpus):   23.172 ± 0.013M/s  (  2.897M/s/cpu)
> uprobe-nop            (10 cpus):   29.430 ± 0.021M/s  (  2.943M/s/cpu)
> uprobe-nop            (12 cpus):   32.035 ± 0.008M/s  (  2.670M/s/cpu)
> uprobe-nop            (14 cpus):   37.174 ± 0.046M/s  (  2.655M/s/cpu)
> uprobe-nop            (16 cpus):   39.793 ± 0.005M/s  (  2.487M/s/cpu)
> uprobe-nop            (24 cpus):   61.656 ± 0.187M/s  (  2.569M/s/cpu)
> uprobe-nop            (32 cpus):   79.616 ± 0.207M/s  (  2.488M/s/cpu)
> uprobe-nop            (40 cpus):   96.851 ± 0.128M/s  (  2.421M/s/cpu)
> uprobe-nop            (48 cpus):  104.178 ± 0.033M/s  (  2.170M/s/cpu)
> uprobe-nop            (56 cpus):  105.689 ± 0.703M/s  (  1.887M/s/cpu)
> uprobe-nop            (64 cpus):  119.432 ± 0.146M/s  (  1.866M/s/cpu)
> uprobe-nop            (72 cpus):  127.574 ± 0.033M/s  (  1.772M/s/cpu)
> uprobe-nop            (80 cpus):  135.162 ± 0.207M/s  (  1.690M/s/cpu)
> 
> So, say, at 32 threads, we get 79.6 vs 77.1, which is about 3%
> throughput win. Which is not negligible!
> 
> Note that as we get to 80 cores data is more noisy (hyperthreading,
> background system noise, etc). But you can still see an improvement
> across basically the entire range.
> 
> Hopefully the above data is useful.

Thank you very much for running this, Andrii!

And I agree that it is no surprise that Tasks Trace RCU is faster than
SRCU-lite, but I must confess that I never did expect SRCU's array
accesses to be free.  And the ability to create multiple independent
instances of SRCU-lite might compensate in at least some cases.

							Thanx, Paul

> > ------------------------------------------------------------------------
> >
> >  Documentation/admin-guide/kernel-parameters.txt   |    4
> >  b/Documentation/admin-guide/kernel-parameters.txt |    8 +
> >  b/include/linux/srcu.h                            |   21 +-
> >  b/include/linux/srcutree.h                        |    2
> >  b/kernel/rcu/rcutorture.c                         |   28 +--
> >  b/kernel/rcu/refscale.c                           |   54 +++++--
> >  b/kernel/rcu/srcutree.c                           |   16 +-
> >  include/linux/srcu.h                              |   86 +++++++++--
> >  include/linux/srcutree.h                          |    5
> >  kernel/rcu/rcutorture.c                           |   37 +++-
> >  kernel/rcu/srcutree.c                             |  168 +++++++++++++++-------
> >  11 files changed, 308 insertions(+), 121 deletions(-)