lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871q7b8ymc.fsf@mailhost.krisman.be>
Date: Thu, 11 Apr 2024 15:53:31 -0400
From: Gabriel Krisman Bertazi <krisman@...e.de>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: <davem@...emloft.net>,  <lmb@...valent.com>,  <martin.lau@...nel.org>,
  <netdev@...r.kernel.org>,  <willemdebruijn.kernel@...il.com>
Subject: Re: [PATCH v2] udp: Avoid call to compute_score on multiple sites

Kuniyuki Iwashima <kuniyu@...zon.com> writes:

> From: Gabriel Krisman Bertazi <krisman@...e.de>
> Date: Wed, 10 Apr 2024 21:54:30 -0400
>> Willem de Bruijn <willemdebruijn.kernel@...il.com> writes:
>> 
>> > Kuniyuki Iwashima wrote:
>> >> From: Gabriel Krisman Bertazi <krisman@...e.de>
>> >> Date: Wed, 10 Apr 2024 17:50:47 -0400
>> >> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
>> >> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
>> >> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
>> >> > sockets are present").  The failing tests were those that would spawn
>> >> > UDP sockets per-cpu on systems that have a high number of cpus.
>> >> > 
>> >> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
>> >> > socket, but due to the compiler no longer inlining compute_score, once
>> >> > it has the extra call site in udp4_lib_lookup2.  This is augmented by
>> >> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>> >> > 
>> >> > We could just explicitly inline it, but compute_score() is quite a large
>> >> > function, around 300b.  Inlining in two sites would almost double
>> >> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
>> >> > mitigation.  Instead, this patch shuffles the code a bit to avoid the
>> >> > multiple calls to compute_score.  Since it is a static function used in
>> >> > one spot, the compiler can safely fold it in, as it did before, without
>> >> > increasing the text size.
>> >> > 
>> >> > With this patch applied I ran my original iperf3 testcases.  The failing
>> >> > cases all looked like this (ipv4):
>> >> > 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
>> >> > 
>> >> > where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
>> >> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
>> >> > tree. harmean == harmonic mean; CV == coefficient of variation.
>> >> > 
>> >> > ipv4:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
>> >> > patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
>> >> > 
>> >> > ipv6:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
>> >> > patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
>> >> > 
>> >> > This restores the performance we had before the change above with this
>> >> > benchmark.  We obviously don't expect any real impact when mitigations
>> >> > are disabled, but just to be sure it also doesn't regresses:
>> >> > 
>> >> > mitigations=off ipv4:
>> >> >                  1G                10G                  MAX
>> >> > 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
>> >> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
>> >> > patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>> >> > 
>> >> > Cc: Lorenz Bauer <lmb@...valent.com>
>> >> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
>> >> > Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
>> >> > 
>> >> > ---
>> >> > Changes since v1:
>> >> > (me)
>> >> >   - recollected performance data after changes below only for the
>> >> >   mitigations enabled case.
>> >> > (suggested by Willem de Bruijn)
>> >> >   - Drop __always_inline in compute_score
>> >> >   - Simplify logic by replacing third struct sock pointer with bool
>> >> >   - Fix typo in commit message
>> >> >   - Don't explicitly break out of loop after rescore
>> >> > ---
>> >> >  net/ipv4/udp.c | 18 +++++++++++++-----
>> >> >  net/ipv6/udp.c | 17 +++++++++++++----
>> >> >  2 files changed, 26 insertions(+), 9 deletions(-)
>> >> > 
>> >> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> >> > index 661d0e0d273f..a13ef8e06093 100644
>> >> > --- a/net/ipv4/udp.c
>> >> > +++ b/net/ipv4/udp.c
>> >> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> >> >  {
>> >> >  	struct sock *sk, *result;
>> >> >  	int score, badness;
>> >> > +	bool rescore = false;
>> >> 
>> >> nit: Keep reverse xmax tree order.
>> >> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>> >> 
>> >> >  
>> >> >  	result = NULL;
>> >> >  	badness = 0;
>> >> >  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>> >> > -		score = compute_score(sk, net, saddr, sport,
>> >> > -				      daddr, hnum, dif, sdif);
>> >> > +rescore:
>> >> > +		score = compute_score((rescore ? result : sk), net, saddr,
>> >> 
>> >> I guess () is not needed around rescore ?
>> >> 
>> >> Both same for IPv6.
>> >> 
>> >> Otherwise, looks good to me.
>> >> 
>> >> Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
>> >
>> > Can we avoid using the same name for the label and boolean?
>> >
>> > And since if looping result will have state TCP_ESTABLISHED, can it
>> > just be
>> >
>> >     sk = result;
>> >     goto rescore;
>> 
>> This would be much simpler, sure.  I actually didn't want to do it
>> because sk is the iteration cursor, and I couldn't prove to myself it is
>> safe to skip through part of the list (assuming result isn't the
>> immediate next socket in the list).
>
> Good point, this is not safe actually.
>
> Let's say sockets on the same port are placed in these order in the list:
>
>   1. TCP_CLOSE sk w/ SO_INCOMING_CPU _not_ matching the current CPU
>   2. TCP_ESTABLISHED sk matching 4-tuple
>   3. TCP_CLOSE sk w/ SO_INCOMING_CPU matching the current CPU
>
> When we check the first socket, we'll get the 3rd socket as it matches
> the current CPU ID and TCP_ESTABLISHED cannot be selected without BPF,
> and `sk = result;` skips the 2nd socket, which should have been
> selected.

k. Considering this, what I get from the discussion is:

since we need to preserve the sk pointer, we are keeping the
condition (rescore ?  result:sk) in the compute_score call and
maintaining the reset of rescore in the hotpath to prevent scoring the
wrong thing on further loops. It is ugly, but it is a single instruction
over a in-register value, so it hardly matters.

If so, I'll do the style changes (parenthesis, sort of stack variables)
and add the early continue to resume the loop right after the rescore,
similar to what I had in V1, that Willem suggested:

            badness = score;

+            if (rescore)
+                    continue;

Please, let me know if I misunderstood, so I don't send a bogus v3.  I
will take some hours to run the tests, so I should send a v3 by
tomorrow.

-- 
Gabriel Krisman Bertazi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ