netdev - Re: [PATCH v2] udp: Avoid call to compute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240410221301.44576-1-kuniyu@amazon.com>
Date: Wed, 10 Apr 2024 15:13:01 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <krisman@...e.de>
CC: <davem@...emloft.net>, <lmb@...valent.com>, <martin.lau@...nel.org>,
	<netdev@...r.kernel.org>, <willemdebruijn.kernel@...il.com>, "Kuniyuki
 Iwashima" <kuniyu@...zon.com>
Subject: Re: [PATCH v2] udp: Avoid call to compute_score on multiple sites

From: Gabriel Krisman Bertazi <krisman@...e.de>
Date: Wed, 10 Apr 2024 17:50:47 -0400
> We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> sockets are present").  The failing tests were those that would spawn
> UDP sockets per-cpu on systems that have a high number of cpus.
> 
> Unsurprisingly, it is not caused by the extra re-scoring of the reused
> socket, but due to the compiler no longer inlining compute_score, once
> it has the extra call site in udp4_lib_lookup2.  This is augmented by
> the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> 
> We could just explicitly inline it, but compute_score() is quite a large
> function, around 300b.  Inlining in two sites would almost double
> udp4_lib_lookup2, which is a silly thing to do just to workaround a
> mitigation.  Instead, this patch shuffles the code a bit to avoid the
> multiple calls to compute_score.  Since it is a static function used in
> one spot, the compiler can safely fold it in, as it did before, without
> increasing the text size.
> 
> With this patch applied I ran my original iperf3 testcases.  The failing
> cases all looked like this (ipv4):
> 	iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> 
> where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
> baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> tree. harmean == harmonic mean; CV == coefficient of variation.
> 
> ipv4:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> patched  1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> 
> ipv6:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline  1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> patched   1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> 
> This restores the performance we had before the change above with this
> benchmark.  We obviously don't expect any real impact when mitigations
> are disabled, but just to be sure it also doesn't regresses:
> 
> mitigations=off ipv4:
>                  1G                10G                  MAX
> 	    HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
> baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> 
> Cc: Lorenz Bauer <lmb@...valent.com>
> Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
> 
> ---
> Changes since v1:
> (me)
>   - recollected performance data after changes below only for the
>   mitigations enabled case.
> (suggested by Willem de Bruijn)
>   - Drop __always_inline in compute_score
>   - Simplify logic by replacing third struct sock pointer with bool
>   - Fix typo in commit message
>   - Don't explicitly break out of loop after rescore
> ---
>  net/ipv4/udp.c | 18 +++++++++++++-----
>  net/ipv6/udp.c | 17 +++++++++++++----
>  2 files changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 661d0e0d273f..a13ef8e06093 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  {
>  	struct sock *sk, *result;
>  	int score, badness;
> +	bool rescore = false;

nit: Keep reverse xmax tree order.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

>  
>  	result = NULL;
>  	badness = 0;
>  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> -		score = compute_score(sk, net, saddr, sport,
> -				      daddr, hnum, dif, sdif);
> +rescore:
> +		score = compute_score((rescore ? result : sk), net, saddr,

I guess () is not needed around rescore ?

Both same for IPv6.

Otherwise, looks good to me.

Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>


> +				      sport, daddr, hnum, dif, sdif);
> +		rescore = false;
>  		if (score > badness) {
>  			badness = score;
>  
> @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>  			if (IS_ERR(result))
>  				continue;
>  
> -			badness = compute_score(result, net, saddr, sport,
> -						daddr, hnum, dif, sdif);
> -
> +			/* compute_score is too long of a function to be
> +			 * inlined, and calling it again here yields
> +			 * measureable overhead for some
> +			 * workloads. Work around it by jumping
> +			 * backwards to rescore 'result'.
> +			 */
> +			rescore = true;
> +			goto rescore;
>  		}
>  	}
>  	return result;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 7c1e6469d091..7a55c050de2b 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -168,12 +168,15 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>  {
>  	struct sock *sk, *result;
>  	int score, badness;
> +	bool rescore = false;
>  
>  	result = NULL;
>  	badness = -1;
>  	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> -		score = compute_score(sk, net, saddr, sport,
> -				      daddr, hnum, dif, sdif);
> +rescore:
> +		score = compute_score((rescore ? result : sk), net, saddr,
> +				      sport, daddr, hnum, dif, sdif);
> +		rescore = false;
>  		if (score > badness) {
>  			badness = score;
>  
> @@ -197,8 +200,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
>  			if (IS_ERR(result))
>  				continue;
>  
> -			badness = compute_score(sk, net, saddr, sport,
> -						daddr, hnum, dif, sdif);
> +			/* compute_score is too long of a function to be
> +			 * inlined, and calling it again here yields
> +			 * measureable overhead for some
> +			 * workloads. Work around it by jumping
> +			 * backwards to rescore 'result'.
> +			 */
> +			rescore = true;
> +			goto rescore;
>  		}
>  	}
>  	return result;
> -- 
> 2.44.0