[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240413012539.16180-1-kuniyu@amazon.com>
Date: Fri, 12 Apr 2024 18:25:39 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <krisman@...e.de>
CC: <davem@...emloft.net>, <kuniyu@...zon.com>, <lmb@...valent.com>,
<martin.lau@...nel.org>, <netdev@...r.kernel.org>,
<willemdebruijn.kernel@...il.com>
Subject: Re: [PATCH v3] udp: Avoid call to compute_score on multiple sites
From: Gabriel Krisman Bertazi <krisman@...e.de>
Date: Fri, 12 Apr 2024 17:20:04 -0400
> We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> sockets are present"). The failing tests were those that would spawn
> UDP sockets per-cpu on systems that have a high number of cpus.
>
> Unsurprisingly, it is not caused by the extra re-scoring of the reused
> socket, but due to the compiler no longer inlining compute_score, once
> it has the extra call site in udp4_lib_lookup2. This is augmented by
> the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>
> We could just explicitly inline it, but compute_score() is quite a large
> function, around 300b. Inlining in two sites would almost double
> udp4_lib_lookup2, which is a silly thing to do just to workaround a
> mitigation. Instead, this patch shuffles the code a bit to avoid the
> multiple calls to compute_score. Since it is a static function used in
> one spot, the compiler can safely fold it in, as it did before, without
> increasing the text size.
>
> With this patch applied I ran my original iperf3 testcases. The failing
> cases all looked like this (ipv4):
> iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
>
> where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each.
> baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of
> variation.
>
> ipv4:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386)
> patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256)
>
> ipv6:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083)
> patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519)
>
> This restores the performance we had before the change above with this
> benchmark. We obviously don't expect any real impact when mitigations
> are disabled, but just to be sure it also doesn't regresses:
>
> mitigations=off ipv4:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>
> Cc: Lorenz Bauer <lmb@...valent.com>
> Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
Thanks!
>
> ---
> Changes since v2:
> (me)
> - recollected performance data after changes below only for the
> mitigations=auto case.
> (suggested by Willem de Bruijn)
> - Explicitly continue the loop after a rescore
> - rename rescore variable to not clash with jump label
> - disable rescore for new loop iteration
> (suggested by Kuniyuki Iwashima)
> - sort stack variables
> - drop unneeded ()
>
> Changes since v1:
> (me)
> - recollected performance data after changes below only for the
> mitigations enabled case.
> (suggested by Willem de Bruijn)
> - Drop __always_inline in compute_score
> - Simplify logic by replacing third struct sock pointer with bool
> - Fix typo in commit message
> - Don't explicitly break out of loop after rescore
> ---
> net/ipv4/udp.c | 21 ++++++++++++++++-----
> net/ipv6/udp.c | 20 ++++++++++++++++----
> 2 files changed, 32 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c02bf011d4a6..4eff1e145c63 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -427,15 +427,21 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> {
> struct sock *sk, *result;
> int score, badness;
> + bool need_rescore;
>
> result = NULL;
> badness = 0;
> udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> - score = compute_score(sk, net, saddr, sport,
> - daddr, hnum, dif, sdif);
> + need_rescore = false;
> +rescore:
> + score = compute_score(need_rescore ? result : sk, net, saddr,
> + sport, daddr, hnum, dif, sdif);
> if (score > badness) {
> badness = score;
>
> + if (need_rescore)
> + continue;
> +
> if (sk->sk_state == TCP_ESTABLISHED) {
> result = sk;
> continue;
> @@ -456,9 +462,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> if (IS_ERR(result))
> continue;
>
> - badness = compute_score(result, net, saddr, sport,
> - daddr, hnum, dif, sdif);
> -
> + /* compute_score is too long of a function to be
> + * inlined, and calling it again here yields
> + * measureable overhead for some
> + * workloads. Work around it by jumping
> + * backwards to rescore 'result'.
> + */
> + need_rescore = true;
> + goto rescore;
> }
> }
> return result;
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 8b1dd7f51249..e80e8b1d2000 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -168,15 +168,21 @@ static struct sock *udp6_lib_lookup2(struct net *net,
> {
> struct sock *sk, *result;
> int score, badness;
> + bool need_rescore;
>
> result = NULL;
> badness = -1;
> udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> - score = compute_score(sk, net, saddr, sport,
> - daddr, hnum, dif, sdif);
> + need_rescore = false;
> +rescore:
> + score = compute_score(need_rescore ? result : sk, net, saddr,
> + sport, daddr, hnum, dif, sdif);
> if (score > badness) {
> badness = score;
>
> + if (need_rescore)
> + continue;
> +
> if (sk->sk_state == TCP_ESTABLISHED) {
> result = sk;
> continue;
> @@ -197,8 +203,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
> if (IS_ERR(result))
> continue;
>
> - badness = compute_score(sk, net, saddr, sport,
> - daddr, hnum, dif, sdif);
> + /* compute_score is too long of a function to be
> + * inlined, and calling it again here yields
> + * measureable overhead for some
> + * workloads. Work around it by jumping
> + * backwards to rescore 'result'.
> + */
> + need_rescore = true;
> + goto rescore;
> }
> }
> return result;
> --
> 2.44.0
Powered by blists - more mailing lists