[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42a6d218-206b-4f87-a8fa-ef42d107fb23@kernel.org>
Date: Fri, 3 May 2024 16:00:20 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Waiman Long <longman@...hat.com>, tj@...nel.org, hannes@...xchg.org,
lizefan.x@...edance.com, cgroups@...r.kernel.org, yosryahmed@...gle.com
Cc: netdev@...r.kernel.org, linux-mm@...ck.org, shakeel.butt@...ux.dev,
kernel-team@...udflare.com, Arnaldo Carvalho de Melo <acme@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and
tracepoints
On 02/05/2024 20.19, Waiman Long wrote:
> On 5/2/24 07:23, Jesper Dangaard Brouer wrote:
>>
>>
>> On 01/05/2024 20.41, Waiman Long wrote:
>>> On 5/1/24 13:22, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 01/05/2024 16.24, Waiman Long wrote:
>>>>> On 5/1/24 10:04, Jesper Dangaard Brouer wrote:
>>>>>> This closely resembles helpers added for the global
>>>>>> cgroup_rstat_lock in
>>>>>> commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and
>>>>>> tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.
>>>>>>
>>>>>> Based on production workloads, we observe the fast-path "update"
>>>>>> function
>>>>>> cgroup_rstat_updated() is invoked around 3 million times per sec,
>>>>>> while the
>>>>>> "flush" function cgroup_rstat_flush_locked(), walking each
>>>>>> possible CPU,
>>>>>> can see periodic spikes of 700 invocations/sec.
>>>>>>
>>>>>> For this reason, the tracepoints are split into normal and fastpath
>>>>>> versions for this per-CPU lock. Making it feasible for production to
>>>>>> continuously monitor the non-fastpath tracepoint to detect lock
>>>>>> contention
>>>>>> issues. The reason for monitoring is that lock disables IRQs which
>>>>>> can
>>>>>> disturb e.g. softirq processing on the local CPUs involved. When the
>>>>>> global cgroup_rstat_lock stops disabling IRQs (e.g converted to a
>>>>>> mutex),
>>>>>> this per CPU lock becomes the next bottleneck that can introduce
>>>>>> latency
>>>>>> variations.
>>>>>>
>>>>>> A practical bpftrace script for monitoring contention latency:
>>>>>>
>>>>>> bpftrace -e '
>>>>>> tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
>>>>>> @start[tid]=nsecs; @cnt[probe]=count()}
>>>>>> tracepoint:cgroup:cgroup_rstat_cpu_locked {
>>>>>> if (args->contended) {
>>>>>> @wait_ns=hist(nsecs-@...rt[tid]); delete(@start[tid]);}
>>>>>> @cnt[probe]=count()}
>>>>>> interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt);
>>>>>> clear(@cnt);}'
>>>>>
>>>>> This is a per-cpu lock. So the only possible contention involves
>>>>> only 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A
>>>>> flusher CPU doing cgroup_rstat_flush_locked() calling into
>>>>> cgroup_rstat_updated_list(). With recent commits to reduce the
>>>>> percpu lock hold time, I doubt lock contention on the percpu lock
>>>>> will have a great impact on latency.
>>>>
>>>> I do appriciate your recent changes to reduce the percpu lock hold
>>>> time.
>>>> These tracepoints allow me to measure and differentiate the percpu lock
>>>> hold time vs. the flush time.
>>>>
>>>> In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g.
>>>> upto 29 ms, which is time spend after obtaining the lock (runtime under
>>>> lock). I was expecting to see "High Lock-contention wait" [L82] which
>>>> is the time waiting for obtaining the lock.
>>>>
>>>> This is why I'm adding these tracepoints, as they allow me to digg
>>>> deeper, to understand where this high runtime variations originate
>>>> from.
>>>>
>>>>
>>>> Data:
>>>>
>>>> 16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34
>>>> comm:kswapd4
>>>> 16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34
>>>> comm:kswapd4
>>>> 16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98
>>>> comm:kswapd4
>>>> 16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113
>>>> comm:kswapd6
>>>> 16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36
>>>> comm:kworker/u261:12
>>> That lock hold time is much higher than I would have expected.
>>>> 16:52:09 time elapsed: 80 sec (interval = 1 sec)
>>>> Flushes(5033) 294/interval (avg 62/sec)
>>>> Locks(53374) 1748/interval (avg 667/sec)
>>>> Yields(48341) 1454/interval (avg 604/sec)
>>>> Contended(48104) 1450/interval (avg 601/sec)
>>>>
>>>>
>>>>> So do we really need such an elaborate scheme to monitor this? BTW,
>>>>> the additional code will also add to the worst case latency.
>>>>
>>>> Hmm, I designed this code to have minimal impact, as tracepoints are
>>>> no-ops until activated. I really doubt this code will change the
>>>> latency.
>>>>
>>>>
>>>> [1]
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
>>>>
>>>> [L100]
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100
>>>>
>>>> [L82]
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82
>>>>
>>>>>>
>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
>>>>
>>>> More data, the histogram of time spend under the lock have some strange
>>>> variation issues with a group in 4ms to 65ms area. Investigating what
>>>> can be causeing this... which next step depend in these tracepoints.
>>>>
>>>> @lock_cnt: 759146
>>>>
>>>> @locked_ns:
>>>> [1K, 2K) 499 | |
>>>> [2K, 4K) 206928
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [4K, 8K) 147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>>> [8K, 16K) 64453 |@@@@@@@@@@@@@@@@ |
>>>> [16K, 32K) 135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>>> [32K, 64K) 75943 |@@@@@@@@@@@@@@@@@@@ |
>>>> [64K, 128K) 38359 |@@@@@@@@@ |
>>>> [128K, 256K) 46597 |@@@@@@@@@@@ |
>>>> [256K, 512K) 32466 |@@@@@@@@ |
>>>> [512K, 1M) 3945 | |
>>>> [1M, 2M) 642 | |
>>>> [2M, 4M) 750 | |
>>>> [4M, 8M) 1932 | |
>>>> [8M, 16M) 2114 | |
>>>> [16M, 32M) 1039 | |
>>>> [32M, 64M) 108 | |
>>>>
>>>>
>>>>
>>>>
>>>>>> ---
>>>>>> include/trace/events/cgroup.h | 56
>>>>>> +++++++++++++++++++++++++++++----
>>>>>> kernel/cgroup/rstat.c | 70
>>>>>> ++++++++++++++++++++++++++++++++++-------
>>>>>> 2 files changed, 108 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/include/trace/events/cgroup.h
>>>>>> b/include/trace/events/cgroup.h
>>>>>> index 13f375800135..0b95865a90f3 100644
>>>>>> --- a/include/trace/events/cgroup.h
>> [...]
>>>>>> +++ b/include/trace/events/cgroup.h >>>>
>>>>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
>>>>>> +
>>>>>> + TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>>>>> +
>>>>>> + TP_ARGS(cgrp, cpu, contended)
>>>>>> +);
>>>>>> +
>>>>>> #endif /* _TRACE_CGROUP_H */
>>>>>> /* This part must be outside protection */
>>>>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>>>>>> index 52e3b0ed1cee..fb8b49437573 100644
>>>>>> --- a/kernel/cgroup/rstat.c
>>>>>> +++ b/kernel/cgroup/rstat.c
>>>>>> @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu
>>>>>> *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
>>>>>> return per_cpu_ptr(cgrp->rstat_cpu, cpu);
>>>>>> }
>>>>>> +/*
>>>>>> + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>>>>>> + *
>>>>>> + * This makes it easier to diagnose locking issues and contention in
>>>>>> + * production environments. The parameter @fast_path determine the
>>>>>> + * tracepoints being added, allowing us to diagnose "flush" related
>>>>>> + * operations without handling high-frequency fast-path "update"
>>>>>> events.
>>>>>> + */
>>>>>> +static __always_inline
>>>>>> +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock,
>>>>>> int cpu,
>>>>>> + struct cgroup *cgrp, const bool fast_path)
>>>>>> +{
>>>>>> + unsigned long flags;
>>>>>> + bool contended;
>>>>>> +
>>>>>> + /*
>>>>>> + * The _irqsave() is needed because cgroup_rstat_lock is
>>>>>> + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>>>>>> + * this lock with the _irq() suffix only disables interrupts on
>>>>>> + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>>>>>> + * interrupts on both configurations. The _irqsave() ensures
>>>>>> + * that interrupts are always disabled and later restored.
>>>>>> + */
>>>>>> + contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>>>>>> + if (contended) {
>>>>>> + if (fast_path)
>>>>>> + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu,
>>>>>> contended);
>>>>>> + else
>>>>>> + trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu,
>>>>>> contended);
>>>>>> +
>>>>>> + raw_spin_lock_irqsave(cpu_lock, flags);
>>>
>>> Could you do a local_irq_save() before calling trace_cgroup*() and
>>> raw_spin_lock()? Would that help in eliminating this high lock hold
>>> time?
>>>
>>
>> Nope it will not eliminating high lock *hold* time, because the hold
>> start timestamp is first taken *AFTER* obtaining the lock.
>>
>> It could help the contended "wait-time" measurement, but my prod
>> measurements show this isn't an issues.
>
> Right.
>
>
>>
>>> You can also do a local_irq_save() first before the trylock. That
>>> will eliminate the duplicated irq_restore() and irq_save() when there
>>> is contention.
>>
>> I wrote the code like this on purpose ;-)
>> My issue with this code/lock is it cause latency issues for softirq
>> NET_RX. So, when I detect a "contended" lock event, I do want a
>> irq_restore() as that will allow networking/do_softirq() to run before
>> I start waiting for the lock (with IRQ disabled).
>>
> Assuming the time taken by the tracing code is negligible, we are
> talking about disabling IRQ almost immediate after enabling it. The
> trylock time should be relatively short so the additional delay due to
> irq disabled for the whole period is insignificant.
>>
>>> If not, there may be NMIs mixed in.
>>>
>>
>> NMIs are definitely on my list of things to investigate.
>> These AMD CPUs also have other types of interrupts that needs a close
>> look.
>>
>> The easier explaination is that the lock isn't "yielded" on every cycle
>> through the for each CPU loop.
>>
>> Lets look at the data I provided above:
>>
>> >> Flushes(5033) 294/interval (avg 62/sec)
>> >> Locks(53374) 1748/interval (avg 667/sec)
>> >> Yields(48341) 1454/interval (avg 604/sec)
>> >> Contended(48104) 1450/interval (avg 601/sec)
>>
>> In this 1 second sample, we have 294 flushes, and more yields 1454,
>> great but the factor is not 128 (num-of-CPUs) but closer to 5. Thus, on
>> average we hold the lock for (128/5) 25.6 CPUs-walks.
>>
>> We have spoken about releasing the lock on for_each CPU before... it
>> will likely solve this long hold time, but IMHO a mutex is still the
>> better solution.
>
> I may have mistakenly thinking the lock hold time refers to just the
> cpu_lock. Your reported times here are about the cgroup_rstat_lock.
> Right? If so, the numbers make sense to me.
>
True, my reported number here are about the cgroup_rstat_lock.
Glad to hear, we are more aligned then :-)
Given I just got some prod machines online with this patch
cgroup_rstat_cpu_lock tracepoints, I can give you some early results,
about hold-time for the cgroup_rstat_cpu_lock.
From this oneliner bpftrace commands:
sudo bpftrace -e '
tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
@start[tid]=nsecs; @cnt[probe]=count()}
tracepoint:cgroup:cgroup_rstat_cpu_locked {
$now=nsecs;
if (args->contended) {
@wait_per_cpu_ns=hist($now-@...rt[tid]); delete(@start[tid]);}
@cnt[probe]=count(); @locked[tid]=$now}
tracepoint:cgroup:cgroup_rstat_cpu_unlock {
$now=nsecs;
@locked_per_cpu_ns=hist($now-@...ked[tid]);
delete(@locked[tid]);
@cnt[probe]=count()}
interval:s:1 {time("%H:%M:%S "); print(@wait_per_cpu_ns);
print(@locked_per_cpu_ns); print(@cnt); clear(@cnt);}'
Results from one 1 sec period:
13:39:55 @wait_per_cpu_ns:
[512, 1K) 3 |
|
[1K, 2K) 12 |@
|
[2K, 4K) 390
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 70 |@@@@@@@@@
|
[8K, 16K) 24 |@@@
|
[16K, 32K) 183 |@@@@@@@@@@@@@@@@@@@@@@@@
|
[32K, 64K) 11 |@
|
@locked_per_cpu_ns:
[256, 512) 75592 |@
|
[512, 1K) 2537357
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 528615 |@@@@@@@@@@
|
[2K, 4K) 168519 |@@@
|
[4K, 8K) 162039 |@@@
|
[8K, 16K) 100730 |@@
|
[16K, 32K) 42276 |
|
[32K, 64K) 1423 |
|
[64K, 128K) 89 |
|
@cnt[tracepoint:cgroup:cgroup_rstat_cpu_lock_contended]: 3 /sec
@cnt[tracepoint:cgroup:cgroup_rstat_cpu_unlock]: 3200 /sec
@cnt[tracepoint:cgroup:cgroup_rstat_cpu_locked]: 3200 /sec
So, we see "flush-code-path" per-CPU-holding @locked_per_cpu_ns isn't
exceeding 128 usec.
My latency requirements, to avoid RX-queue overflow, with 1024 slots,
running at 25 Gbit/s, is 27.6 usec with small packets, and 500 usec
(0.5ms) with MTU size packets. This is very close to my latency
requirements.
--Jesper
Powered by blists - more mailing lists