[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMuHMdWyMU9x0rgoWxhu4-qH=QmgjnE1DfrVVYAcKOaN3Cc1wg@mail.gmail.com>
Date: Tue, 4 Jul 2023 10:40:38 +0200
From: Geert Uytterhoeven <geert@...ux-m68k.org>
To: linux-kernel@...r.kernel.org
Cc: linux-tip-commits@...r.kernel.org,
Noah Goldstein <goldstein.w.n@...il.com>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"open list:KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@...r.kernel.org>
Subject: Re: [tip: x86/misc] x86/csum: Improve performance of `csum_partial`
CC linux-m68k
On Sun, Jul 2, 2023 at 5:19 PM Geert Uytterhoeven <geert@...ux-m68k.org> wrote:
> On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein
> <tip-bot2@...utronix.de> wrote:
> > The following commit has been merged into the x86/misc branch of tip:
> >
> > Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad
> > Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad
> > Author: Noah Goldstein <goldstein.w.n@...il.com>
> > AuthorDate: Wed, 10 May 2023 20:10:02 -05:00
> > Committer: Dave Hansen <dave.hansen@...ux.intel.com>
> > CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
> >
> > x86/csum: Improve performance of `csum_partial`
> >
> > 1) Add special case for len == 40 as that is the hottest value. The
> > nets a ~8-9% latency improvement and a ~30% throughput improvement
> > in the len == 40 case.
> >
> > 2) Use multiple accumulators in the 64-byte loop. This dramatically
> > improves ILP and results in up to a 40% latency/throughput
> > improvement (better for more iterations).
> >
> > Results from benchmarking on Icelake. Times measured with rdtsc()
> > len lat_new lat_old r tput_new tput_old r
> > 8 3.58 3.47 1.032 3.58 3.51 1.021
> > 16 4.14 4.02 1.028 3.96 3.78 1.046
> > 24 4.99 5.03 0.992 4.23 4.03 1.050
> > 32 5.09 5.08 1.001 4.68 4.47 1.048
> > 40 5.57 6.08 0.916 3.05 4.43 0.690
> > 48 6.65 6.63 1.003 4.97 4.69 1.059
> > 56 7.74 7.72 1.003 5.22 4.95 1.055
> > 64 6.65 7.22 0.921 6.38 6.42 0.994
> > 96 9.43 9.96 0.946 7.46 7.54 0.990
> > 128 9.39 12.15 0.773 8.90 8.79 1.012
> > 200 12.65 18.08 0.699 11.63 11.60 1.002
> > 272 15.82 23.37 0.677 14.43 14.35 1.005
> > 440 24.12 36.43 0.662 21.57 22.69 0.951
> > 952 46.20 74.01 0.624 42.98 53.12 0.809
> > 1024 47.12 78.24 0.602 46.36 58.83 0.788
> > 1552 72.01 117.30 0.614 71.92 96.78 0.743
> > 2048 93.07 153.25 0.607 93.28 137.20 0.680
> > 2600 114.73 194.30 0.590 114.28 179.32 0.637
> > 3608 156.34 268.41 0.582 154.97 254.02 0.610
> > 4096 175.01 304.03 0.576 175.89 292.08 0.602
> >
> > There is no such thing as a free lunch, however, and the special case
> > for len == 40 does add overhead to the len != 40 cases. This seems to
> > amount to be ~5% throughput and slightly less in terms of latency.
> >
> > Testing:
> > Part of this change is a new kunit test. The tests check all
> > alignment X length pairs in [0, 64) X [0, 512).
> > There are three cases.
> > 1) Precomputed random inputs/seed. The expected results where
> > generated use the generic implementation (which is assumed to be
> > non-buggy).
> > 2) An input of all 1s. The goal of this test is to catch any case
> > a carry is missing.
> > 3) An input that never carries. The goal of this test si to catch
> > any case of incorrectly carrying.
> >
> > More exhaustive tests that test all alignment X length pairs in
> > [0, 8192) X [0, 8192] on random data are also available here:
> > https://github.com/goldsteinn/csum-reproduction
> >
> > The reposity also has the code for reproducing the above benchmark
> > numbers.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@...il.com>
> > Signed-off-by: Dave Hansen <dave.hansen@...ux.intel.com>
>
> Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum:
> Improve performance of `csum_partial`") in linus/master stable/master
>
> > Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
>
> This does not seem to be a message sent to a public mailing list
> archived at lore (yet).
>
> On m68k (ARAnyM):
>
> KTAP version 1
> # Subtest: checksum
> 1..3
> # test_csum_fixed_random_inputs: ASSERTION FAILED at
> lib/checksum_kunit.c:243
> Expected result == expec, but
> result == 54991 (0xd6cf)
> expec == 33316 (0x8224)
> not ok 1 test_csum_fixed_random_inputs
> # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267
> Expected result == expec, but
> result == 255 (0xff)
> expec == 65280 (0xff00)
>
> Endianness issue in the test?
>
> not ok 2 test_csum_all_carry_inputs
> # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306
> Expected result == expec, but
> result == 64515 (0xfc03)
> expec == 0 (0x0)
> not ok 3 test_csum_no_carry_inputs
> # checksum: pass:0 fail:3 skip:0 total:3
> # Totals: pass:0 fail:3 skip:0 total:3
> not ok 1 checksum
As this test already exposed bugs on x86 and UML:
https://lore.kernel.org/all/20230704083206.693155-2-davidgow@google.com
https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.com
there's probably a real bug in the m68k csumming code.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@...ux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
Powered by blists - more mailing lists