netdev - Re: [PATCH] udp: Avoid call to compute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6611892d95684_172b6329493@willemb.c.googlers.com.notmuch>
Date: Sat, 06 Apr 2024 13:41:01 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
 Gabriel Krisman Bertazi <krisman@...e.de>, 
 willemdebruijn.kernel@...il.com, 
 davem@...emloft.net
Cc: netdev@...r.kernel.org, 
 martin.lau@...nel.org, 
 Gabriel Krisman Bertazi <krisman@...e.de>, 
 Lorenz Bauer <lmb@...valent.com>
Subject: Re: [PATCH] udp: Avoid call to compute_score on multiple sites

Willem de Bruijn wrote:
> Gabriel Krisman Bertazi wrote:
> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > sockets are present").  The failing tests were those that would spawn
> > UDP sockets per-cpu on systems that have a high number of cpus.
> > 
> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > socket, but due to the compiler no longer inlining compute_score, once
> > it has the extra call site in upd5_lib_lookup2.  This is augmented by
> 
> udp4_lib_lookup2
> 
> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > 
> > We could just explicitly inline it, but compute_score() is quite a large
> > function, around 300b.  Inlining in two sites would almost double
> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
> > multiple calls to compute_score.  Since it is a static function used in
> > one spot, the compiler can safely fold it in, as it did before, without
> > increasing the text size.
> > 
> > With this patch applied I ran my original iperf3 testcases.  The failing
> > cases all looked like this (ipv4):
> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2 2>&1
> > 
> > where $R is either 1G/10G/0 (max, unlimited).  I ran 5 times each.
> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > tree. harmean == harmonic mean; CV == coefficient of variation.
> > 
> > ipv4:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline 1726716.59(0.0401) 1751758.50(0.0068) 1425388.83(0.1276)
> > patched  1842337.77(0.0711) 1861574.00(0.0774) 1888601.95(0.0580)
> > 
> > ipv6:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline: 1693636.28(0.0132) 1704418.23(0.0094) 1519681.83(0.1299)
> > patched   1909754.24(0.0307) 1782295.80(0.0539) 1632803.48(0.1185)
> > 
> > This restores the performance we had before the change above with this
> > benchmark.  We obviously don't expect any real impact when mitigations
> > are disabled, but just to be sure it also doesn't regresses:
> > 
> > mitigations=off ipv4:
> >                  1G                10G                  MAX
> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > 
> > Finally, I can see this restores compute_score inlining in my gcc
> > without extra function attributes. Out of caution, I still added
> > __always_inline in compute_score, to prevent future changes from
> > un-inlining it again.  Since it is only in one site, it should be fine.
> > 
> > Cc: Lorenz Bauer <lmb@...valent.com>
> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
> > 
> > ---
> > Another idea would be shrinking compute_score and then inlining it.  I'm
> > not a network developer, but it seems that we can avoid most of the
> > "same network" checks of calculate_score when passing a socket from the
> > reusegroup.  If that is the case, we can fork out a compute_score_fast
> > that can be safely inlined at the second call site of the existing
> > compute_score.  I didn't pursue this any further.
> > ---
> >  net/ipv4/udp.c | 24 ++++++++++++++++++------
> >  net/ipv6/udp.c | 23 ++++++++++++++++++-----
> >  2 files changed, 36 insertions(+), 11 deletions(-)
> > 
> 
> > diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> > index 7c1e6469d091..883e62228432 100644
> > --- a/net/ipv6/udp.c
> > +++ b/net/ipv6/udp.c
> > @@ -114,7 +114,11 @@ void udp_v6_rehash(struct sock *sk)
> >  	udp_lib_rehash(sk, new_hash);
> >  }
> >  
> > -static int compute_score(struct sock *sk, struct net *net,
> > +/* While large, compute_score is in the UDP hot path and only used once
> > + * in udp4_lib_lookup2. Avoiding the function call by inlining it has
> 
> udp6_lib_lookup2
> 
> > + * yield measurable benefits in iperf3-based benchmarks.
> > + */
> > +static __always_inline int compute_score(struct sock *sk, struct net *net,
> >  			 const struct in6_addr *saddr, __be16 sport,
> >  			 const struct in6_addr *daddr, unsigned short hnum,

Forgot to mention: __always_inline is used very sparingly.

I don't think this qualifies. It did not have that attribute before,
nor needs it.