[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f9d0836e-3673-380c-29e2-21b31447207f@bytedance.com>
Date: Fri, 24 Feb 2023 12:08:59 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: paulmck@...nel.org
Cc: akpm@...ux-foundation.org, tkhai@...ru, hannes@...xchg.org,
shakeelb@...gle.com, mhocko@...nel.org, roman.gushchin@...ux.dev,
muchun.song@...ux.dev, david@...hat.com, shy828301@...il.com,
sultan@...neltoast.com, dave@...olabs.net,
penguin-kernel@...ove.SAKURA.ne.jp, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/7] make slab shrink lockless
On 2023/2/24 02:19, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
>>
>> 1. Background
>> =============
>>
>> On our servers, we often find the following system cpu hotspots:
>>
>> 44.16% [kernel] [k] down_read_trylock
>> 14.12% [kernel] [k] up_read
>> 13.43% [kernel] [k] shrink_slab
>> 5.25% [kernel] [k] count_shadow_nodes
>> 3.42% [kernel] [k] idr_find
>>
>> Then we used bpftrace to capture its calltrace as follows:
>>
>> @[
>> down_read_trylock+5
>> shrink_slab+292
>> shrink_node+640
>> do_try_to_free_pages+211
>> try_to_free_mem_cgroup_pages+266
>> try_charge_memcg+386
>> charge_memcg+51
>> __mem_cgroup_charge+44
>> __handle_mm_fault+1416
>> handle_mm_fault+260
>> do_user_addr_fault+459
>> exc_page_fault+104
>> asm_exc_page_fault+38
>> clear_user_rep_good+18
>> read_zero+100
>> vfs_read+176
>> ksys_read+93
>> do_syscall_64+62
>> entry_SYSCALL_64_after_hwframe+114
>> ]: 1868979
>>
>> It is easy to see that this is caused by the frequent failure to obtain the
>> read lock of shrinker_rwsem when reclaiming slab memory.
>>
>> Currently, the shrinker_rwsem is a global lock. And the following cases may
>> cause the above system cpu hotspots:
>>
>> a. the write lock of shrinker_rwsem was held for too long. For example, there
>> are many memcgs in the system, which causes some paths to hold locks and
>> traverse it for too long. (e.g. expand_shrinker_info())
>> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
>> this time. Then this writer will be forced to wait and block all subsequent
>> readers.
>> For example:
>> - be scheduled when the read lock of shrinker_rwsem is held in
>> do_shrink_slab()
>> - some shrinker are blocked for too long. Like the case mentioned in the
>> patchset[1].
>>
>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>
>> And all the down_read_trylock() hotspots caused by the above cases can be
>> solved by replacing the shrinker_rwsem trylocks with SRCU.
Hi Paul,
>
> Glad to see that making SRCU unconditional was helpful! And I do very
> much like the idea of the shrinker running better!
+1 :)
>
> The main thing that enabled unconditional SRCU was the code added in
> v5.19 to dynamically allocate SRCU's srcu_node combining tree. This is
> important for a number of Linux distributions that have NR_CPUS up in the
> thousands, for which this combining tree is quite large. In v5.19 and
> later, srcu_struct structures without frequent call_srcu() invocations
> never allocate that combining tree. Even srcu_struct structures that
> have enough call_srcu() activity to cause the lock contention that in
> turn forces the combining tree to be allocated, that combining tree
> is sized for the actual number of CPUs present, which is usually way
> smaller than NR_CPUS.
Thank you very much for such a detailed background introduction. :)
>
> So if you are going to backport this back past v5.19, you might also
> need those SRCU changes. Or not, depending on how much memory your
> systems are equipped with. ;-)
Got it.
Thanks,
Qi
>
> Thanx, Paul
>
>> 2. Survey
>> =========
>>
>> Before doing the code implementation, I found that there were many similar
>> submissions in the community:
>>
>> a. Davidlohr Bueso submitted a patch in 2015.
>> Subject: [PATCH -next v2] mm: srcu-ify shrinkers
>> Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>> Result: It was finally merged into the linux-next branch, but failed on arm
>> allnoconfig (without CONFIG_SRCU)
>>
>> b. Tetsuo Handa submitted a patchset in 2017.
>> Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
>> Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>> Result: Finally chose to use the current simple way (break when
>> rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
>> but SRCU was not unconditionally enabled at the time.
>>
>> c. Kirill Tkhai submitted a patchset in 2018.
>> Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
>> Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>> Result: At that time, SRCU was not unconditionally enabled, and there were
>> some objections to enabling SRCU. Later, because Kirill's focus was
>> moved to other things, this patchset was not continued to be updated.
>>
>> d. Sultan Alsawaf submitted a patch in 2021.
>> Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
>> Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>> Result: Rejected because SRCU was not unconditionally enabled.
>>
>> We can find that almost all these historical commits were abandoned because SRCU
>> was not unconditionally enabled. But now SRCU has been unconditionally enable
>> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
>> with SRCU.
>>
>> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
>>
>> 3. Reproduction and testing
>> ===========================
>>
>> We can reproduce the down_read_trylock() hotspot through the following script:
>>
>> ```
>> #!/bin/bash
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>> mkdir /sys/fs/cgroup/memory/test
>> echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> for i in `seq 0 $1`;
>> do
>> mkdir /sys/fs/cgroup/memory/test/$i;
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> mkdir -p $DIR/$i;
>> done
>> }
>>
>> do_mount()
>> {
>> for i in `seq $1 $2`;
>> do
>> mount -t tmpfs $i $DIR/$i;
>> done
>> }
>>
>> do_touch()
>> {
>> for i in `seq $1 $2`;
>> do
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>> done
>> }
>>
>> do_create 2000
>> do_mount 0 2000
>> do_touch 0 1000
>> ```
>>
>> Save the above script and execute it, we can get the following perf hotspots:
>>
>> 46.60% [kernel] [k] down_read_trylock
>> 18.70% [kernel] [k] up_read
>> 15.44% [kernel] [k] shrink_slab
>> 4.37% [kernel] [k] _find_next_bit
>> 2.75% [kernel] [k] xa_load
>> 2.07% [kernel] [k] idr_find
>> 1.73% [kernel] [k] do_shrink_slab
>> 1.42% [kernel] [k] shrink_lruvec
>> 0.74% [kernel] [k] shrink_node
>> 0.60% [kernel] [k] list_lru_count_one
>>
>> After applying this patchset, the hotspot becomes as follows:
>>
>> 19.53% [kernel] [k] _find_next_bit
>> 14.63% [kernel] [k] do_shrink_slab
>> 14.58% [kernel] [k] shrink_slab
>> 11.83% [kernel] [k] shrink_lruvec
>> 9.33% [kernel] [k] __blk_flush_plug
>> 6.67% [kernel] [k] mem_cgroup_iter
>> 3.73% [kernel] [k] list_lru_count_one
>> 2.43% [kernel] [k] shrink_node
>> 1.96% [kernel] [k] super_cache_count
>> 1.78% [kernel] [k] __rcu_read_unlock
>> 1.38% [kernel] [k] __srcu_read_lock
>> 1.30% [kernel] [k] xas_descend
>>
>> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
>> which realizes the lockless slab reclaim.
>>
>> This series is based on next-20230217.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>> Changelog in v1 -> v2:
>> - add a map_nr_max field to shrinker_info (suggested by Kirill)
>> - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
>>
>> Qi Zheng (7):
>> mm: vmscan: add a map_nr_max field to shrinker_info
>> mm: vmscan: make global slab shrink lockless
>> mm: vmscan: make memcg slab shrink lockless
>> mm: shrinkers: make count and scan in shrinker debugfs lockless
>> mm: vmscan: hold write lock to reparent shrinker nr_deferred
>> mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
>> mm: shrinkers: convert shrinker_rwsem to mutex
>>
>> drivers/md/dm-cache-metadata.c | 2 +-
>> drivers/md/dm-thin-metadata.c | 2 +-
>> fs/super.c | 2 +-
>> include/linux/memcontrol.h | 1 +
>> mm/shrinker_debug.c | 38 ++++-----
>> mm/vmscan.c | 142 +++++++++++++++++----------------
>> 6 files changed, 92 insertions(+), 95 deletions(-)
>>
>> --
>> 2.20.1
>>
--
Thanks,
Qi
Powered by blists - more mailing lists