[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <661ac03b65993_3be9a729488@willemb.c.googlers.com.notmuch>
Date: Sat, 13 Apr 2024 13:26:19 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>,
krisman@...e.de
Cc: davem@...emloft.net,
kuniyu@...zon.com,
lmb@...valent.com,
martin.lau@...nel.org,
netdev@...r.kernel.org,
willemdebruijn.kernel@...il.com
Subject: Re: [PATCH v3] udp: Avoid call to compute_score on multiple sites
Kuniyuki Iwashima wrote:
> From: Gabriel Krisman Bertazi <krisman@...e.de>
> Date: Fri, 12 Apr 2024 17:20:04 -0400
> > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > sockets are present"). The failing tests were those that would spawn
> > UDP sockets per-cpu on systems that have a high number of cpus.
> >
> > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > socket, but due to the compiler no longer inlining compute_score, once
> > it has the extra call site in udp4_lib_lookup2. This is augmented by
> > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> >
> > We could just explicitly inline it, but compute_score() is quite a large
> > function, around 300b. Inlining in two sites would almost double
> > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > mitigation. Instead, this patch shuffles the code a bit to avoid the
> > multiple calls to compute_score. Since it is a static function used in
> > one spot, the compiler can safely fold it in, as it did before, without
> > increasing the text size.
> >
> > With this patch applied I ran my original iperf3 testcases. The failing
> > cases all looked like this (ipv4):
> > iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> >
> > where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each.
> > baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of
> > variation.
> >
> > ipv4:
> > 1G 10G MAX
> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386)
> > patched 1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256)
> >
> > ipv6:
> > 1G 10G MAX
> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083)
> > patched 1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519)
> >
> > This restores the performance we had before the change above with this
> > benchmark. We obviously don't expect any real impact when mitigations
> > are disabled, but just to be sure it also doesn't regresses:
> >
> > mitigations=off ipv4:
> > 1G 10G MAX
> > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> >
> > Cc: Lorenz Bauer <lmb@...valent.com>
> > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > Signed-off-by: Gabriel Krisman Bertazi <krisman@...e.de>
>
> Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
Reviewed-by: Willem de Bruijn <willemb@...gle.com>
Powered by blists - more mailing lists