[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFnufp2M_9_TRxoXbRK0bggPXyTgffYnA4moez=uWDNNb=aT8w@mail.gmail.com>
Date: Sun, 19 Sep 2021 21:13:24 +0200
From: Matteo Croce <mcroce@...ux.microsoft.com>
To: David Laight <David.Laight@...lab.com>
Cc: Guo Ren <guoren@...nel.org>, Palmer Dabbelt <palmer@...belt.com>,
linux-riscv <linux-riscv@...ts.infradead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-arch <linux-arch@...r.kernel.org>,
Paul Walmsley <paul.walmsley@...ive.com>,
Albert Ou <aou@...s.berkeley.edu>,
Atish Patra <Atish.Patra@....com>,
Emil Renner Berthing <kernel@...il.dk>,
Akira Tsukamoto <akira.tsukamoto@...il.com>,
Drew Fustini <drew@...gleboard.org>,
Bin Meng <bmeng.cn@...il.com>,
Christoph Hellwig <hch@...radead.org>
Subject: Re: [PATCH] riscv: use the generic string routines
On Mon, Sep 13, 2021 at 1:35 PM David Laight <David.Laight@...lab.com> wrote:
>
> > > These ended up getting rejected by Linus, so I'm going to hold off on
> > > this for now. If they're really out of lib/ then I'll take the C
> > > routines in arch/riscv, but either way it's an issue for the next
> > > release.
> > Agree, we should take the C routine in arch/riscv for common
> > implementation. If any vendor what custom implementation they could
> > use the alternative framework in errata for string operations.
>
> I though the asm ones were significantly faster because
> they were less affected by read latency.
>
> (But they were horribly broken for misaligned transfers.)
>
I can get the same exact performance (and a very similar machine code)
in C with this on top of the C memset implementation:
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -112,9 +112,12 @@ EXPORT_SYMBOL(__memmove);
void *memmove(void *dest, const void *src, size_t count) __weak
__alias(__memmove);
EXPORT_SYMBOL(memmove);
+#define BATCH 4
+
void *__memset(void *s, int c, size_t count)
{
union types dest = { .as_u8 = s };
+ int i;
if (count >= MIN_THRESHOLD) {
unsigned long cu = (unsigned long)c;
@@ -138,8 +141,12 @@ void *__memset(void *s, int c, size_t count)
}
/* Copy using the largest size allowed */
- for (; count >= BYTES_LONG; count -= BYTES_LONG)
- *dest.as_ulong++ = cu;
+ for (; count >= BYTES_LONG * BATCH; count -= BYTES_LONG * BATCH) {
+#pragma GCC unroll 4
+ for (i = 0; i < BATCH; i++)
+ dest.as_ulong[i] = cu;
+ dest.as_ulong += BATCH;
+ }
}
On the BeagleV the memset speed with the different batch size are:
1 (stock): 267 Mb/s
2: 272 Mb/s
4: 276 Mb/s
8: 276 Mb/s
The problem with biggest batch size is that it will fallback to a
single byte copy if the buffers are too small.
Regards,
--
per aspera ad upstream
Powered by blists - more mailing lists