[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87selmok1y.fsf@oracle.com>
Date: Fri, 02 May 2025 13:05:13 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: "Christoph Lameter (Ampere)" <cl@...two.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-arch@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
bpf@...r.kernel.org, arnd@...db.de, catalin.marinas@....com,
will@...nel.org, peterz@...radead.org, akpm@...ux-foundation.org,
mark.rutland@....com, harisokn@...zon.com, ast@...nel.org,
memxor@...il.com, zhenglifeng1@...wei.com, xueshuai@...ux.alibaba.com,
joao.m.martins@...cle.com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com
Subject: Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()
Christoph Lameter (Ampere) <cl@...two.org> writes:
> On Fri, 2 May 2025, Ankur Arora wrote:
>
>> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
>> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
>> things and time_expr_ns and time_limit_ns being used to decide the
>> terminating condition.
>>
>> There were some problems in the timekeeping:
>>
>> 1. How often do we do the (relatively expensive) time-check?
>
> Is this really important? We have instructions that wait on an event and
> terminate at cycle counter values like WFET on arm64
>
> The case were we need to perform time checks is only needed if the
> processor does not support WFET but must use a event stream or does not
> even have that available.
AFAICT the vast majority of arm64 processors in the wild don't yet
support WFET. For instance I haven't been able to find a single one
to test my WFET changes with ;).
The other part is that this needs to be in common code and x86 primarily
uses PAUSE.
So, supporting both configurations: WFE + spin on arm64 and PAUSE on x86
needs a way of rate-limiting the time-check. Otherwise you run into
issues like this one:
commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
Author: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
Date: Tue Mar 27 23:58:45 2018 +0200
cpuidle: poll_state: Avoid invoking local_clock() too often
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
> So the best approach is to have a simple interface were we specify the
> cycle count when the wait is to be terminated and where we can cover that
> with one WFET instruction.
>
> The other cases then are degenerate forms of that. If only WFE is
> available then only use that if the timeout is larger than the event
> stream granularity. Or if both are not available them do the relax /
> loop thing.
That's what I had proposed for v1. But as Catalin pointed out, that's
not very useful when the caller wants to limit the overshoot.
This version tries to optimistically use WFE where possible while
minimizing the spin time.
> So the interface could be much simpler:
>
> __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count)
>
> with a wrapper
>
> smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count)
Oh, I would have absolutely liked to keep the interface simple, but
couldn't see a way to do that while managing the other constraints.
For instance, different users want different clocks: poll_idle() can do
with an imprecise clock but rqspinlock needs ktime_get_mono().
I think using standard clock types is also better instead of using
arm64 specific cycles or tsc or whatever.
> where we check the expression too and retry if the expression is not true.
>
> The fallbacks with the spins and relax logic would be architecture
> specific.
Even if they were all architecture specific, I suspect there's quite a
lot of variation between cpu_relax() across microarchitectures. For
instance YIELD is a nop on non-SMT arm64, but probably heavier on
SMT arm64.
Thanks for the quick review!
Ankur
Powered by blists - more mailing lists