[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <36c737e1-7e1c-7098-8bd5-1767869489d9@bytedance.com>
Date: Tue, 28 Feb 2023 18:53:17 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Mike Rapoport <rppt@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: tkhai@...ru, hannes@...xchg.org, shakeelb@...gle.com,
mhocko@...nel.org, roman.gushchin@...ux.dev, muchun.song@...ux.dev,
david@...hat.com, shy828301@...il.com, sultan@...neltoast.com,
dave@...olabs.net, penguin-kernel@...ove.sakura.ne.jp,
paulmck@...nel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/8] make slab shrink lockless
On 2023/2/28 18:04, Qi Zheng wrote:
>
>
> On 2023/2/27 23:08, Mike Rapoport wrote:
>> Hi,
>>
>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng
>>>> <zhengqi.arch@...edance.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch series aims to make slab shrink lockless.
>>>>
>>>> What an awesome changelog.
>>>>
>>>>> 2. Survey
>>>>> =========
>>>>
>>>> Especially this part.
>>>>
>>>> Looking through all the prior efforts and at this patchset I am not
>>>> immediately seeing any statements about the overall effect upon
>>>> real-world workloads. For a good example, does this patchset
>>>> measurably improve throughput or energy consumption on your servers?
>>>
>>> Hi Andrew,
>>>
>>> I re-tested with the following physical machines:
>>>
>>> Architecture: x86_64
>>> CPU(s): 96
>>> On-line CPU(s) list: 0-95
>>> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>
>>> I found that the reason for the hotspot I described in cover letter is
>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>> the failure to trylock, but simply because of the atomic operation
>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>> per cycle).
>>
>> ...
>>> Then we can use the following perf command to view hotspots:
>>>
>>> perf top -U -F 999
>>>
>>> 1) Before applying this patchset:
>>>
>>> 32.31% [kernel] [k] down_read_trylock
>>> 19.40% [kernel] [k] pv_native_safe_halt
>>> 16.24% [kernel] [k] up_read
>>> 15.70% [kernel] [k] shrink_slab
>>> 4.69% [kernel] [k] _find_next_bit
>>> 2.62% [kernel] [k] shrink_node
>>> 1.78% [kernel] [k] shrink_lruvec
>>> 0.76% [kernel] [k] do_shrink_slab
>>>
>>> 2) After applying this patchset:
>>>
>>> 27.83% [kernel] [k] _find_next_bit
>>> 16.97% [kernel] [k] shrink_slab
>>> 15.82% [kernel] [k] pv_native_safe_halt
>>> 9.58% [kernel] [k] shrink_node
>>> 8.31% [kernel] [k] shrink_lruvec
>>> 5.64% [kernel] [k] do_shrink_slab
>>> 3.88% [kernel] [k] mem_cgroup_iter
>>>
>>> 2. At the same time, we use the following perf command to capture IPC
>>> information:
>>>
>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>
>>> 1) Before applying this patchset:
>>>
>>> Performance counter stats for 'system wide' (5 runs):
>>>
>>> 454187219766 cycles
>>> test (
>>> +- 1.84% )
>>> 78896433101 instructions test # 0.17
>>> insn per
>>> cycle ( +- 0.44% )
>>>
>>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>>
>>> 2) After applying this patchset:
>>>
>>> Performance counter stats for 'system wide' (5 runs):
>>>
>>> 841954709443 cycles
>>> test (
>>> +- 15.80% ) (98.69%)
>>> 527258677936 instructions test # 0.63
>>> insn per
>>> cycle ( +- 15.11% ) (98.68%)
>>>
>>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>>
>>> We can see that IPC drops very seriously when calling
>>> down_read_trylock() at high frequency. After using SRCU,
>>> the IPC is at a normal level.
>>
>> The results you present do show improvement in IPC for an artificial test
>> script. But more interesting would be to see how a real world workloads
>> benefit from your changes.
>
> Hi Mike and Andrew,
>
> I did encounter this problem under the real workload of our online
> server. At the end of this email, I posted another call stack and
> hot spot that I found before.
>
> I scanned the hotspots of all our online servers yesterday and today,
> but unfortunately did not find the live environment.
>
> Some of our servers have a large number of containers, and each
> container will mount some file systems. This is likely to trigger
> down_read_trylock() hotspots when the memory pressure of the whole
> machine or the memory pressure of memcg is high.
And the servers where this hotspot has happened (we have a hotspot alarm
record), basically have 96 cores, or 128 cores or even more.
>
> So I just found a physical server with a similar configuration to the
> online server yesterday for a simulation test. The call stack and the
> hot spot in the simulation test are almost exactly the same, so in
> theory, when such a hot spot appears on the online server, we can also
> enjoy the improvement of IPC. This will improve the performance of the
> server in memory exhaustion scenarios (memcg or global level).
>
> And the above scenario is only one aspect, and the other aspect is the
> lock competition scenario mentioned by Kirill. After applying this patch
> set, slab shrink and register_shrinker() can be completely parallelized,
> which can fix that problem.
>
> These are the two main benefits for real workloads that I consider.
>
> Thanks,
> Qi
>
> call stack
> ----------
>
> @[
> down_read_trylock+1
> shrink_slab+128
> shrink_node+371
> do_try_to_free_pages+232
> try_to_free_pages+243
> _alloc_pages_slowpath+771
> _alloc_pages_nodemask+702
> pagecache_get_page+255
> filemap_fault+1361
> ext4_filemap_fault+44
> __do_fault+76
> handle_mm_fault+3543
> do_user_addr_fault+442
> do_page_fault+48
> page_fault+62
> ]: 1161690
> @[
> down_read_trylock+1
> shrink_slab+128
> shrink_node+371
> balance_pgdat+690
> kswapd+389
> kthread+246
> ret_from_fork+31
> ]: 8424884
> @[
> down_read_trylock+1
> shrink_slab+128
> shrink_node+371
> do_try_to_free_pages+232
> try_to_free_pages+243
> __alloc_pages_slowpath+771
> __alloc_pages_nodemask+702
> __do_page_cache_readahead+244
> filemap_fault+1674
> ext4_filemap_fault+44
> __do_fault+76
> handle_mm_fault+3543
> do_user_addr_fault+442
> do_page_fault+48
> page_fault+62
> ]: 20917631
>
> hotspot
> -------
>
> 52.22% [kernel] [k] down_read_trylock
> 19.60% [kernel] [k] up_read
> 8.86% [kernel] [k] shrink_slab
> 2.44% [kernel] [k] idr_find
> 1.25% [kernel] [k] count_shadow_nodes
> 1.18% [kernel] [k] shrink lruvec
> 0.71% [kernel] [k] mem_cgroup_iter
> 0.71% [kernel] [k] shrink_node
> 0.55% [kernel] [k] find_next_bit
>
>
>>> Thanks,
>>> Qi
>>
>
--
Thanks,
Qi
Powered by blists - more mailing lists