[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66115fea712ed_16bd4c294c6@willemb.c.googlers.com.notmuch>
Date: Sat, 06 Apr 2024 10:44:58 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Gabriel Krisman Bertazi <krisman@...e.de>,
willemdebruijn.kernel@...il.com,
davem@...emloft.net
Cc: netdev@...r.kernel.org,
martin.lau@...nel.org,
Gabriel Krisman Bertazi <krisman@...e.de>,
Lorenz Bauer <lmb@...valent.com>
Subject: Re: [PATCH] udp: Avoid call to compute_score on multiple sites
Gabriel Krisman Bertazi wrote:
> We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> sockets are present"). The failing tests were those that would spawn
> UDP sockets per-cpu on systems that have a high number of cpus.
>
> Unsurprisingly, it is not caused by the extra re-scoring of the reused
> socket, but due to the compiler no longer inlining compute_score, once
> it has the extra call site in upd5_lib_lookup2. This is augmented by
udp4_lib_lookup2
> the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>
> We could just explicitly inline it, but compute_score() is quite a large
> function, around 300b. Inlining in two sites would almost double
> udp4_lib_lookup2, which is a silly thing to do just to workaround a
> mitigation. Instead, this patch shuffles the code a bit to avoid the
> multiple calls to compute_score. Since it is a static function used in
> one spot, the compiler can safely fold it in, as it did before, without
> increasing the text size.
>
> With this patch applied I ran my original iperf3 testcases. The failing
> cases all looked like this (ipv4):
> iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2 2>&1
>
> where $R is either 1G/10G/0 (max, unlimited). I ran 5 times each.
> baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> tree. harmean == harmonic mean; CV == coefficient of variation.
>
> ipv4:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline 1726716.59(0.0401) 1751758.50(0.0068) 1425388.83(0.1276)
> patched 1842337.77(0.0711) 1861574.00(0.0774) 1888601.95(0.0580)
>
> ipv6:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline: 1693636.28(0.0132) 1704418.23(0.0094) 1519681.83(0.1299)
> patched 1909754.24(0.0307) 1782295.80(0.0539) 1632803.48(0.1185)
>
> This restores the performance we had before the change above with this
> benchmark. We obviously don't expect any real impact when mitigations
> are disabled, but just to be sure it also doesn't regresses:
>
> mitigations=off ipv4:
> 1G 10G MAX
> HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>
> Finally, I can see this restores compute_score inlining in my gcc
> without extra function attributes. Out of caution, I still added
> __always_inline in compute_score, to prevent future changes from
> un-inlining it again. Since it is only in one site, it should be fine.
>
> Cc: Lorenz Bauer <lmb@...valent.com>
> Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
>
> ---
> Another idea would be shrinking compute_score and then inlining it. I'm
> not a network developer, but it seems that we can avoid most of the
> "same network" checks of calculate_score when passing a socket from the
> reusegroup. If that is the case, we can fork out a compute_score_fast
> that can be safely inlined at the second call site of the existing
> compute_score. I didn't pursue this any further.
> ---
> net/ipv4/udp.c | 24 ++++++++++++++++++------
> net/ipv6/udp.c | 23 ++++++++++++++++++-----
> 2 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 7c1e6469d091..883e62228432 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -114,7 +114,11 @@ void udp_v6_rehash(struct sock *sk)
> udp_lib_rehash(sk, new_hash);
> }
>
> -static int compute_score(struct sock *sk, struct net *net,
> +/* While large, compute_score is in the UDP hot path and only used once
> + * in udp4_lib_lookup2. Avoiding the function call by inlining it has
udp6_lib_lookup2
> + * yield measurable benefits in iperf3-based benchmarks.
> + */
> +static __always_inline int compute_score(struct sock *sk, struct net *net,
> const struct in6_addr *saddr, __be16 sport,
> const struct in6_addr *daddr, unsigned short hnum,
> int dif, int sdif)
> @@ -166,16 +170,20 @@ static struct sock *udp6_lib_lookup2(struct net *net,
> int dif, int sdif, struct udp_hslot *hslot2,
> struct sk_buff *skb)
> {
> - struct sock *sk, *result;
> + struct sock *sk, *result, *this;
> int score, badness;
>
> result = NULL;
> badness = -1;
> udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> - score = compute_score(sk, net, saddr, sport,
> + this = sk;
> +rescore:
> + score = compute_score(this, net, saddr, sport,
> daddr, hnum, dif, sdif);
> if (score > badness) {
> badness = score;
> + if (this != sk)
> + continue;
Can we just rely on screo not increasing indefinitely on retry
to break out of the loop.
Or, if an explicit "this is a rescore" boolean is needed, a boolean
makes the control flow easier to follow than a third struct sk.
>
> if (sk->sk_state == TCP_ESTABLISHED) {
> result = sk;
> @@ -197,8 +205,13 @@ static struct sock *udp6_lib_lookup2(struct net *net,
> if (IS_ERR(result))
> continue;
>
> - badness = compute_score(sk, net, saddr, sport,
> - daddr, hnum, dif, sdif);
> + /* compute_score is too long of a function to be
> + * inlined, and calling it again yields
> + * measureable overhead. Work around it by
> + * jumping backwards to score 'result'.
> + */
> + this = result;
> + goto rescore;
> }
> }
> return result;
> --
> 2.44.0
>
Powered by blists - more mailing lists