[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a5m08y09.fsf@mailhost.krisman.be>
Date: Wed, 10 Apr 2024 21:54:30 -0400
From: Gabriel Krisman Bertazi <krisman@...e.de>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, davem@...emloft.net,
lmb@...valent.com, martin.lau@...nel.org, netdev@...r.kernel.org
Subject: Re: [PATCH v2] udp: Avoid call to compute_score on multiple sites
Willem de Bruijn <willemdebruijn.kernel@...il.com> writes:
> Kuniyuki Iwashima wrote:
>> From: Gabriel Krisman Bertazi <krisman@...e.de>
>> Date: Wed, 10 Apr 2024 17:50:47 -0400
>> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
>> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
>> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
>> > sockets are present"). The failing tests were those that would spawn
>> > UDP sockets per-cpu on systems that have a high number of cpus.
>> >
>> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
>> > socket, but due to the compiler no longer inlining compute_score, once
>> > it has the extra call site in udp4_lib_lookup2. This is augmented by
>> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
>> >
>> > We could just explicitly inline it, but compute_score() is quite a large
>> > function, around 300b. Inlining in two sites would almost double
>> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
>> > mitigation. Instead, this patch shuffles the code a bit to avoid the
>> > multiple calls to compute_score. Since it is a static function used in
>> > one spot, the compiler can safely fold it in, as it did before, without
>> > increasing the text size.
>> >
>> > With this patch applied I ran my original iperf3 testcases. The failing
>> > cases all looked like this (ipv4):
>> > iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
>> >
>> > where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each.
>> > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
>> > tree. harmean == harmonic mean; CV == coefficient of variation.
>> >
>> > ipv4:
>> > 1G 10G MAX
>> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
>> > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
>> > patched 1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
>> >
>> > ipv6:
>> > 1G 10G MAX
>> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
>> > baseline 1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
>> > patched 1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
>> >
>> > This restores the performance we had before the change above with this
>> > benchmark. We obviously don't expect any real impact when mitigations
>> > are disabled, but just to be sure it also doesn't regresses:
>> >
>> > mitigations=off ipv4:
>> > 1G 10G MAX
>> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
>> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
>> > patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
>> >
>> > Cc: Lorenz Bauer <lmb@...valent.com>
>> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
>> > Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
>> >
>> > ---
>> > Changes since v1:
>> > (me)
>> > - recollected performance data after changes below only for the
>> > mitigations enabled case.
>> > (suggested by Willem de Bruijn)
>> > - Drop __always_inline in compute_score
>> > - Simplify logic by replacing third struct sock pointer with bool
>> > - Fix typo in commit message
>> > - Don't explicitly break out of loop after rescore
>> > ---
>> > net/ipv4/udp.c | 18 +++++++++++++-----
>> > net/ipv6/udp.c | 17 +++++++++++++----
>> > 2 files changed, 26 insertions(+), 9 deletions(-)
>> >
>> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> > index 661d0e0d273f..a13ef8e06093 100644
>> > --- a/net/ipv4/udp.c
>> > +++ b/net/ipv4/udp.c
>> > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> > {
>> > struct sock *sk, *result;
>> > int score, badness;
>> > + bool rescore = false;
>>
>> nit: Keep reverse xmax tree order.
>> https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>>
>> >
>> > result = NULL;
>> > badness = 0;
>> > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
>> > - score = compute_score(sk, net, saddr, sport,
>> > - daddr, hnum, dif, sdif);
>> > +rescore:
>> > + score = compute_score((rescore ? result : sk), net, saddr,
>>
>> I guess () is not needed around rescore ?
>>
>> Both same for IPv6.
>>
>> Otherwise, looks good to me.
>>
>> Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
>
> Can we avoid using the same name for the label and boolean?
>
> And since if looping result will have state TCP_ESTABLISHED, can it
> just be
>
> sk = result;
> goto rescore;
This would be much simpler, sure. I actually didn't want to do it
because sk is the iteration cursor, and I couldn't prove to myself it is
safe to skip through part of the list (assuming result isn't the
immediate next socket in the list). I'll take a better look, as I'm not
familiar with this code, but if you considered this already, I will
follow up with the change you suggested.
>
>>
>> > + sport, daddr, hnum, dif, sdif);
>> > + rescore = false;
>> > if (score > badness) {
>> > badness = score;
>> >
>> > @@ -456,9 +459,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
>> > if (IS_ERR(result))
>> > continue;
>> >
>> > - badness = compute_score(result, net, saddr, sport,
>> > - daddr, hnum, dif, sdif);
>> > -
>> > + /* compute_score is too long of a function to be
>> > + * inlined, and calling it again here yields
>> > + * measureable overhead for some
>> > + * workloads. Work around it by jumping
>> > + * backwards to rescore 'result'.
>> > + */
>> > + rescore = true;
>> > + goto rescore;
>> > }
>> > }
--
Gabriel Krisman Bertazi
Powered by blists - more mailing lists