[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAEEQ3wkoNFmL1h2CXu9jfmVZfBmAC9ZiTo1vLQ8Xcohc9g9J2A@mail.gmail.com>
Date: Tue, 19 Aug 2025 20:45:08 +0800
From: yunhui cui <cuiyunhui@...edance.com>
To: Alexandre Ghiti <alex@...ti.fr>
Cc: yury.norov@...il.com, linux@...musvillemoes.dk, paul.walmsley@...ive.com,
palmer@...belt.com, aou@...s.berkeley.edu, linux-riscv@...ts.infradead.org,
linux-kernel@...r.kernel.org, dennis@...nel.org, tj@...nel.org, cl@...two.org,
linux-mm@...ck.org
Subject: Re: [External] Re: [PATCH RFC 2/2] riscv: introduce percpu.h into include/asm
Hi Alex,
On Thu, Aug 7, 2025 at 10:55 PM Alexandre Ghiti <alex@...ti.fr> wrote:
>
> Hi Yunhui,
>
> On 7/18/25 16:33, yunhui cui wrote:
> > Hi Alex,
> >
> > On Fri, Jul 18, 2025 at 10:23 PM Alexandre Ghiti <alex@...ti.fr> wrote:
> >> Hi Yunhui,
> >>
> >> On 7/18/25 08:40, yunhui cui wrote:
> >>> Hi Alex,
> >>>
> >>> On Thu, Jul 17, 2025 at 9:06 PM Alexandre Ghiti <alex@...ti.fr> wrote:
> >>>> On 7/17/25 15:04, Alexandre Ghiti wrote:
> >>>>> Hi Yunhui,
> >>>>>
> >>>>> On 6/18/25 05:43, Yunhui Cui wrote:
> >>>>>> Current percpu operations rely on generic implementations, where
> >>>>>> raw_local_irq_save() introduces substantial overhead. Optimization
> >>>>>> is achieved through atomic operations and preemption disabling.
> >>>>>>
> >>>>>> Signed-off-by: Yunhui Cui <cuiyunhui@...edance.com>
> >>>>>> ---
> >>>>>> arch/riscv/include/asm/percpu.h | 138 ++++++++++++++++++++++++++++++++
> >>>>>> 1 file changed, 138 insertions(+)
> >>>>>> create mode 100644 arch/riscv/include/asm/percpu.h
> >>>>>>
> >>>>>> diff --git a/arch/riscv/include/asm/percpu.h
> >>>>>> b/arch/riscv/include/asm/percpu.h
> >>>>>> new file mode 100644
> >>>>>> index 0000000000000..423c0d01f874c
> >>>>>> --- /dev/null
> >>>>>> +++ b/arch/riscv/include/asm/percpu.h
> >>>>>> @@ -0,0 +1,138 @@
> >>>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
> >>>>>> +
> >>>>>> +#ifndef __ASM_PERCPU_H
> >>>>>> +#define __ASM_PERCPU_H
> >>>>>> +
> >>>>>> +#include <linux/preempt.h>
> >>>>>> +
> >>>>>> +#define PERCPU_RW_OPS(sz) \
> >>>>>> +static inline unsigned long __percpu_read_##sz(void *ptr) \
> >>>>>> +{ \
> >>>>>> + return READ_ONCE(*(u##sz *)ptr); \
> >>>>>> +} \
> >>>>>> + \
> >>>>>> +static inline void __percpu_write_##sz(void *ptr, unsigned long
> >>>>>> val) \
> >>>>>> +{ \
> >>>>>> + WRITE_ONCE(*(u##sz *)ptr, (u##sz)val); \
> >>>>>> +}
> >>>>>> +
> >>>>>> +#define __PERCPU_AMO_OP_CASE(sfx, name, sz, amo_insn) \
> >>>>>> +static inline void \
> >>>>>> +__percpu_##name##_amo_case_##sz(void *ptr, unsigned long val) \
> >>>>>> +{ \
> >>>>>> + asm volatile ( \
> >>>>>> + "amo" #amo_insn #sfx " zero, %[val], %[ptr]" \
> >>>>>> + : [ptr] "+A" (*(u##sz *)ptr) \
> >>>>>> + : [val] "r" ((u##sz)(val)) \
> >>>>>> + : "memory"); \
> >>>>>> +}
> >>>>>> +
> >>>>>> +#define __PERCPU_AMO_RET_OP_CASE(sfx, name, sz, amo_insn) \
> >>>>>> +static inline u##sz \
> >>>>>> +__percpu_##name##_return_amo_case_##sz(void *ptr, unsigned long
> >>>>>> val) \
> >>>>>> +{ \
> >>>>>> + register u##sz ret; \
> >>>>>> + \
> >>>>>> + asm volatile ( \
> >>>>>> + "amo" #amo_insn #sfx " %[ret], %[val], %[ptr]" \
> >>>>>> + : [ptr] "+A" (*(u##sz *)ptr), [ret] "=r" (ret) \
> >>>>>> + : [val] "r" ((u##sz)(val)) \
> >>>>>> + : "memory"); \
> >>>>>> + \
> >>>>>> + return ret + val; \
> >>>>>> +}
> >>>>>> +
> >>>>>> +#define PERCPU_OP(name, amo_insn) \
> >>>>>> + __PERCPU_AMO_OP_CASE(.b, name, 8, amo_insn) \
> >>>>>> + __PERCPU_AMO_OP_CASE(.h, name, 16, amo_insn) \
> >>>>>> + __PERCPU_AMO_OP_CASE(.w, name, 32, amo_insn) \
> >>>>>> + __PERCPU_AMO_OP_CASE(.d, name, 64, amo_insn) \
> >>>>>> +
> >>>>>> +#define PERCPU_RET_OP(name, amo_insn) \
> >>>>>> + __PERCPU_AMO_RET_OP_CASE(.b, name, 8, amo_insn) \
> >>>>>> + __PERCPU_AMO_RET_OP_CASE(.h, name, 16, amo_insn) \
> >>>>>> + __PERCPU_AMO_RET_OP_CASE(.w, name, 32, amo_insn) \
> >>>>>> + __PERCPU_AMO_RET_OP_CASE(.d, name, 64, amo_insn)
> >>>>>> +
> >>>>>> +PERCPU_RW_OPS(8)
> >>>>>> +PERCPU_RW_OPS(16)
> >>>>>> +PERCPU_RW_OPS(32)
> >>>>>> +PERCPU_RW_OPS(64)
> >>>>>> +
> >>>>>> +PERCPU_OP(add, add)
> >>>>>> +PERCPU_OP(andnot, and)
> >>>>>> +PERCPU_OP(or, or)
> >>>>>> +PERCPU_RET_OP(add, add)
> >>>>>> +
> >>>>>> +#undef PERCPU_RW_OPS
> >>>>>> +#undef __PERCPU_AMO_OP_CASE
> >>>>>> +#undef __PERCPU_AMO_RET_OP_CASE
> >>>>>> +#undef PERCPU_OP
> >>>>>> +#undef PERCPU_RET_OP
> >>>>>> +
> >>>>>> +#define _pcp_protect(op, pcp, ...) \
> >>>>>> +({ \
> >>>>>> + preempt_disable_notrace(); \
> >>>>>> + op(raw_cpu_ptr(&(pcp)), __VA_ARGS__); \
> >>>>>> + preempt_enable_notrace(); \
> >>>>>> +})
> >>>>>> +
> >>>>>> +#define _pcp_protect_return(op, pcp, args...) \
> >>>>>> +({ \
> >>>>>> + typeof(pcp) __retval; \
> >>>>>> + preempt_disable_notrace(); \
> >>>>>> + __retval = (typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args); \
> >>>>>> + preempt_enable_notrace(); \
> >>>>>> + __retval; \
> >>>>>> +})
> >>>>>> +
> >>>>>> +#define this_cpu_read_1(pcp) _pcp_protect_return(__percpu_read_8, pcp)
> >>>>>> +#define this_cpu_read_2(pcp) _pcp_protect_return(__percpu_read_16, pcp)
> >>>>>> +#define this_cpu_read_4(pcp) _pcp_protect_return(__percpu_read_32, pcp)
> >>>>>> +#define this_cpu_read_8(pcp) _pcp_protect_return(__percpu_read_64, pcp)
> >>>>>> +
> >>>>>> +#define this_cpu_write_1(pcp, val) _pcp_protect(__percpu_write_8,
> >>>>>> pcp, (unsigned long)val)
> >>>>>> +#define this_cpu_write_2(pcp, val) _pcp_protect(__percpu_write_16,
> >>>>>> pcp, (unsigned long)val)
> >>>>>> +#define this_cpu_write_4(pcp, val) _pcp_protect(__percpu_write_32,
> >>>>>> pcp, (unsigned long)val)
> >>>>>> +#define this_cpu_write_8(pcp, val) _pcp_protect(__percpu_write_64,
> >>>>>> pcp, (unsigned long)val)
> >>>>>> +
> >>>>>> +#define this_cpu_add_1(pcp, val)
> >>>>>> _pcp_protect(__percpu_add_amo_case_8, pcp, val)
> >>>>>> +#define this_cpu_add_2(pcp, val)
> >>>>>> _pcp_protect(__percpu_add_amo_case_16, pcp, val)
> >>>>>> +#define this_cpu_add_4(pcp, val)
> >>>>>> _pcp_protect(__percpu_add_amo_case_32, pcp, val)
> >>>>>> +#define this_cpu_add_8(pcp, val)
> >>>>>> _pcp_protect(__percpu_add_amo_case_64, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_add_return_1(pcp, val) \
> >>>>>> +_pcp_protect_return(__percpu_add_return_amo_case_8, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_add_return_2(pcp, val) \
> >>>>>> +_pcp_protect_return(__percpu_add_return_amo_case_16, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_add_return_4(pcp, val) \
> >>>>>> +_pcp_protect_return(__percpu_add_return_amo_case_32, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_add_return_8(pcp, val) \
> >>>>>> +_pcp_protect_return(__percpu_add_return_amo_case_64, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_and_1(pcp, val)
> >>>>>> _pcp_protect(__percpu_andnot_amo_case_8, pcp, ~val)
> >>>>>> +#define this_cpu_and_2(pcp, val)
> >>>>>> _pcp_protect(__percpu_andnot_amo_case_16, pcp, ~val)
> >>>>>> +#define this_cpu_and_4(pcp, val)
> >>>>>> _pcp_protect(__percpu_andnot_amo_case_32, pcp, ~val)
> >>>>>> +#define this_cpu_and_8(pcp, val)
> >>>>>> _pcp_protect(__percpu_andnot_amo_case_64, pcp, ~val)
> >>>>> Why do we define __percpu_andnot based on amoand, and use
> >>>>> __percpu_andnot with ~val here? Can't we just define __percpu_and?
> >>
> >> What about that ^?
> >>
> >>
> >>>>>> +
> >>>>>> +#define this_cpu_or_1(pcp, val) _pcp_protect(__percpu_or_amo_case_8,
> >>>>>> pcp, val)
> >>>>>> +#define this_cpu_or_2(pcp, val)
> >>>>>> _pcp_protect(__percpu_or_amo_case_16, pcp, val)
> >>>>>> +#define this_cpu_or_4(pcp, val)
> >>>>>> _pcp_protect(__percpu_or_amo_case_32, pcp, val)
> >>>>>> +#define this_cpu_or_8(pcp, val)
> >>>>>> _pcp_protect(__percpu_or_amo_case_64, pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_xchg_1(pcp, val) _pcp_protect_return(xchg_relaxed,
> >>>>>> pcp, val)
> >>>>>> +#define this_cpu_xchg_2(pcp, val) _pcp_protect_return(xchg_relaxed,
> >>>>>> pcp, val)
> >>>>>> +#define this_cpu_xchg_4(pcp, val) _pcp_protect_return(xchg_relaxed,
> >>>>>> pcp, val)
> >>>>>> +#define this_cpu_xchg_8(pcp, val) _pcp_protect_return(xchg_relaxed,
> >>>>>> pcp, val)
> >>>>>> +
> >>>>>> +#define this_cpu_cmpxchg_1(pcp, o, n)
> >>>>>> _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> >>>>>> +#define this_cpu_cmpxchg_2(pcp, o, n)
> >>>>>> _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> >>>>>> +#define this_cpu_cmpxchg_4(pcp, o, n)
> >>>>>> _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> >>>>>> +#define this_cpu_cmpxchg_8(pcp, o, n)
> >>>>>> _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> >>>>>> +
> >>>>>> +#include <asm-generic/percpu.h>
> >>>>>> +
> >>>>>> +#endif /* __ASM_PERCPU_H */
> >>>>> It all looks good to me, just one thing, can you also implement
> >>>>> this_cpu_cmpxchg64/128()?
> >>>>>
> >>>> One last thing sorry, can you add a cover letter too?
> >>> Okay.
> >>>
> >>>> Thanks!
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>> And since this is almost a copy/paste from arm64, either mention it at
> >>>>> the top of the file or (better) merge both implementations somewhere
> >>>>> to avoid redefining existing code :) But up to you.
> >>> Actually, there's a concern here. We should account for scenarios
> >>> where ZABHA isn't supported. Given that xxx_8() and xxx_16() are
> >>> rarely used in practice, could we initially support only xxx_32() and
> >>> xxx_64()? For xxx_8() and xxx_16(), we could default to the generic
> >>> implementation.
> >>
> >> Why isn't lr/sc enough?
> > If I'm not mistaken, the current RISC-V does not support lr.bh/sc.bh,
> > is that right?
>
>
> Yes, that's right, but we have an implementation of cmpxchg[8|16]() that
> uses lr.w/sc.w which works (unless I missed something, I have just
> checked again), so I think that's alright no?
Since RISC-V does not support lr/sc.b/h, when ZABHA is not supported,
we need to use lr/sc.w instead, which requires some additional mask
operations. In fact, 8/16-bit percpu operations are very few. The
number of system startup operations is as follows:
8-bit Adds: 3, 16-bit Adds: 3, 32-bit Adds: 31858, 64-bit Adds: 7656
8-bit Add-Returns: 0, 16-bit Add-Returns: 0, 32-bit Add-Returns: 0,
64-bit Add-Returns: 2
8-bit ANDs: 0, 16-bit ANDs: 0, 32-bit ANDs: 0, 64-bit ANDs: 0
8-bit ANDNOTs: 0, 16-bit ANDNOTs: 0, 32-bit ANDNOTs: 0, 64-bit ANDNOTs: 0
8-bit ORs: 0, 16-bit ORs: 0, 32-bit ORs: 70, 64-bit ORs: 0
So, when ZABHA is not supported, can 8bit/16bit operations directly
fall back to the generic implementation?
>
> Thanks,
>
> Alex
>
>
> >
> >>
> >>>
> >>>>> Reviewed-by: Alexandre Ghiti <alexghiti@...osinc.com>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>
> >>> Thanks,
> >>> Yunhui
> >>>
> >>> _______________________________________________
> >>> linux-riscv mailing list
> >>> linux-riscv@...ts.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/linux-riscv
> > Thanks,
> > Yunhui
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@...ts.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
Thanks,
Yunhui
Powered by blists - more mailing lists