[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4fd9106c-40a6-415a-9409-c346d7ab91ce@redhat.com>
Date: Tue, 9 Apr 2024 12:59:11 -0400
From: Waiman Long <longman@...hat.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>,
Johannes Weiner <hannes@...xchg.org>, Tejun Heo <tj@...nel.org>,
Jesper Dangaard Brouer <jesper@...udflare.com>,
"David S. Miller" <davem@...emloft.net>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Shakeel Butt <shakeelb@...gle.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
kernel-team <kernel-team@...udflare.com>, cgroups@...r.kernel.org,
Linux-MM <linux-mm@...ck.org>, Netdev <netdev@...r.kernel.org>,
bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
Ivan Babrou <ivan@...udflare.com>
Subject: Re: Advice on cgroup rstat lock
On 4/9/24 12:45, Yosry Ahmed wrote:
> On Tue, Apr 9, 2024 at 8:37 AM Waiman Long <longman@...hat.com> wrote:
>> On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
>>> Let move this discussion upstream.
>>>
>>> On 22/03/2024 19.32, Yosry Ahmed wrote:
>>>> [..]
>>>>>> There was a couple of series that made all calls to
>>>>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>>>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>>>>> problem that we used to face (except in our case, we saw hard lockups
>>>>>> in extreme scenarios):
>>>>>> https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/
>>>>>>
>>>>>> https://lore.kernel.org/lkml/20230421174020.2994750-1-yosryahmed@google.com/
>>>>>>
>>>>> I've only done the 6.6 backport, and these were in 6.5/6.6.
>>> Given I have these in my 6.6 kernel. You are basically saying I should
>>> be able to avoid IRQ-disable for the lock, right?
>>>
>>> My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
>>> and (thereby also) BH/softirq (spin_lock_irq). This cause production
>>> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
>>> latency issues (debug via softirq_net_latency.bt [5]).
>>>
>>> [3]
>>> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
>>> [5]
>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>>>
>>>
>>>>> And between 6.1 to 6.6 we did observe an improvement in this area.
>>>>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
>>>>> progress)
>>>>>
>>>>> I've had a chance to get running in prod for 6.6 backport.
>>>>> As you can see in attached grafana heatmap pictures, we do observe an
>>>>> improved/reduced softirq wait time.
>>>>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
>>>>> observed. As you can see, I still have other areas to improve/fix.
>>>> I am not very familiar with such heatmaps, but I am glad there is an
>>>> improvement with 6.6 and the backports. Let me know if there is
>>>> anything I could do to help with your effort.
>>> The heatmaps give me an overview, but I needed a debugging tool, so I
>>> developed some bpftrace scripts [1][2] I'm running on production.
>>> To measure how long time we hold the cgroup rstat lock (results below).
>>> Adding ACME and Daniel as I hope there is an easier way to measure lock
>>> hold time and congestion. Notice tricky release/yield in
>>> cgroup_rstat_flush_locked[4].
>>>
>>> My production results on 6.6 with backported patches (below signature)
>>> vs a our normal 6.6 kernel, with script [2]. The `@...k_time_hist_ns`
>>> shows how long time the lock+IRQs were disabled (taking into account it
>>> can be released in the loop [4]).
>>>
>>> Patched kernel:
>>>
>>> 21:49:02 time elapsed: 43200 sec
>>> @lock_time_hist_ns:
>>> [2K, 4K) 61 | |
>>> [4K, 8K) 734 | |
>>> [8K, 16K) 121500 |@@@@@@@@@@@@@@@@ |
>>> [16K, 32K) 385714
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K) 145600 |@@@@@@@@@@@@@@@@@@@ |
>>> [64K, 128K) 156873 |@@@@@@@@@@@@@@@@@@@@@ |
>>> [128K, 256K) 261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [256K, 512K) 291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [512K, 1M) 101859 |@@@@@@@@@@@@@ |
>>> [1M, 2M) 19866 |@@ |
>>> [2M, 4M) 10146 |@ |
>>> [4M, 8M) 30633 |@@@@ |
>>> [8M, 16M) 40365 |@@@@@ |
>>> [16M, 32M) 21650 |@@ |
>>> [32M, 64M) 5842 | |
>>> [64M, 128M) 8 | |
>>>
>>> And normal 6.6 kernel:
>>>
>>> 21:48:32 time elapsed: 43200 sec
>>> @lock_time_hist_ns:
>>> [1K, 2K) 25 | |
>>> [2K, 4K) 1146 | |
>>> [4K, 8K) 59397 |@@@@ |
>>> [8K, 16K) 571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [16K, 32K) 542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [32K, 64K) 202810 |@@@@@@@@@@@@@ |
>>> [64K, 128K) 134564 |@@@@@@@@@ |
>>> [128K, 256K) 72870 |@@@@@ |
>>> [256K, 512K) 56914 |@@@ |
>>> [512K, 1M) 83140 |@@@@@ |
>>> [1M, 2M) 170514 |@@@@@@@@@@@ |
>>> [2M, 4M) 396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [4M, 8M) 755537
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [8M, 16M) 231222 |@@@@@@@@@@@@@@@ |
>>> [16M, 32M) 76370 |@@@@@ |
>>> [32M, 64M) 1043 | |
>>> [64M, 128M) 12 | |
>>>
>>>
>>> For the unpatched kernel we see more events in 4ms to 8ms bucket than
>>> any other bucket.
>>> For patched kernel, we clearly see a significant reduction of events in
>>> the 4 ms to 64 ms area, but we still have some events in this area. I'm
>>> very happy to see these patches improves the situation. But for network
>>> processing I'm not happy to see events in area 16ms to 128ms area. If
>>> we can just avoid disabling IRQs/softirq for the lock, I would be happy.
>>>
>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>> The cgroup_rstat_lock was originally a mutex. It was converted to a
>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
>> a spinlock"). Irq was disabled to enable calling from atomic context.
>> Since commit 0a2dc6ac3329 ("cgroup: remove
>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
>> atomic context anymore. Theoretically, we could change it back to a
>> mutex or not disabling interrupt. That will require that the API cannot
>> be called from atomic context going forward.
> I think we should avoid flushing from atomic contexts going forward
> anyway tbh. It's just too much work to do with IRQs disabled, and we
> observed hard lockups before in worst case scenarios.
>
> I think one problem that was discussed before is that flushing is
> exercised from multiple contexts and could have very high concurrency
> (e.g. from reclaim when the system is under memory pressure). With a
> mutex, the flusher could sleep with the mutex held and block other
> threads for a while.
>
> I vaguely recall experimenting locally with changing that lock into a
> mutex and not liking the results, but I can't remember much more. I
> could be misremembering though.
>
> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
> CPU iterations if rescheduling is needed or the lock is being
> contended (i.e. spin_needbreak() returns true). I had always wondered
> if it's possible to introduce a similar primitive for IRQs? We could
> also drop the lock (and re-enable IRQs) if IRQs are pending then.
I am not sure if there is a way to check if a hardirq is pending, but we
do have a local_softirq_pending() helper.
Regards,
Longman
Powered by blists - more mailing lists