[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bbcd23a0-1869-ec95-87a4-4499b50b9683@bytedance.com>
Date: Thu, 9 Mar 2023 14:47:06 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Vlastimil Babka <vbabka@...e.cz>, akpm@...ux-foundation.org,
tkhai@...ru, hannes@...xchg.org, shakeelb@...gle.com,
mhocko@...nel.org, roman.gushchin@...ux.dev, muchun.song@...ux.dev,
david@...hat.com, shy828301@...il.com, rppt@...nel.org
Cc: sultan@...neltoast.com, dave@...olabs.net,
penguin-kernel@...ove.SAKURA.ne.jp, paulmck@...nel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
Hi Vlastimil,
On 2023/3/9 06:46, Vlastimil Babka wrote:
> On 3/7/23 07:56, Qi Zheng wrote:
>> Like global slab shrink, this commit also uses SRCU to make
>> memcg slab shrink lockless.
>>
>> We can reproduce the down_read_trylock() hotspot through the
>> following script:
>>
>> ```
>>
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>> mkdir -p /sys/fs/cgroup/memory/test
>> mkdir -p /sys/fs/cgroup/perf_event/test
>> echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> for i in `seq 0 $1`;
>> do
>> mkdir -p /sys/fs/cgroup/memory/test/$i;
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>> mkdir -p $DIR/$i;
>> done
>> }
>>
>> do_mount()
>> {
>> for i in `seq $1 $2`;
>> do
>> mount -t tmpfs $i $DIR/$i;
>> done
>> }
>>
>> do_touch()
>> {
>> for i in `seq $1 $2`;
>> do
>> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>> echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>> done
>> }
>>
>> case "$1" in
>> touch)
>> do_touch $2 $3
>> ;;
>> test)
>> do_create 4000
>> do_mount 0 4000
>> do_touch 0 3000
>> ;;
>> *)
>> exit 1
>> ;;
>> esac
>> ```
>>
>> Save the above script, then run test and touch commands.
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>> 32.31% [kernel] [k] down_read_trylock
>> 19.40% [kernel] [k] pv_native_safe_halt
>> 16.24% [kernel] [k] up_read
>> 15.70% [kernel] [k] shrink_slab
>> 4.69% [kernel] [k] _find_next_bit
>> 2.62% [kernel] [k] shrink_node
>> 1.78% [kernel] [k] shrink_lruvec
>> 0.76% [kernel] [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>> 27.83% [kernel] [k] _find_next_bit
>> 16.97% [kernel] [k] shrink_slab
>> 15.82% [kernel] [k] pv_native_safe_halt
>> 9.58% [kernel] [k] shrink_node
>> 8.31% [kernel] [k] shrink_lruvec
>> 5.64% [kernel] [k] do_shrink_slab
>> 3.88% [kernel] [k] mem_cgroup_iter
>>
>> At the same time, we use the following perf command to capture
>> IPC information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 454187219766 cycles test ( +- 1.84% )
>> 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% )
>>
>> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% )
>>
>> 2) After applying this patchset:
>>
>> Performance counter stats for 'system wide' (5 runs):
>>
>> 841954709443 cycles test ( +- 15.80% ) (98.69%)
>> 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%)
>>
>> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
>
> The interpretation looks somewhat weird to me. I'd say the workload is
> stalled a lot as it fails the trylock (there might be some optimistic
> spinning perhaps) and then goes to sleep. See how "pv_native_safe_halt" is
> also more prominent in before. And because of that sleeping, there's less
> instructions executed in the same amount of cycles (as it's a system wide
> collection, otherwise it wouldn't be collecting the sleeping processes).
But in my tests, the trylock basically did not fail, so I think it is
caused by high-frequency atomic operation.
bpftrace -e 'kr:down_read_trylock {@[kstack, retval]=count();}
interval:s:1 {exit();}'
Attaching 2 probes...
<...>
@[
shrink_slab+288
shrink_node+640
do_try_to_free_pages+203
try_to_free_mem_cgroup_pages+266
try_charge_memcg+412
charge_memcg+51
__mem_cgroup_charge+44
__handle_mm_fault+2119
handle_mm_fault+272
do_user_addr_fault+712
exc_page_fault+124
asm_exc_page_fault+38
clear_user_erms+14
read_zero+86
vfs_read+173
ksys_read+93
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+99
, 1]: 617019
@[
shrink_slab+288
shrink_node+640
do_try_to_free_pages+203
try_to_free_mem_cgroup_pages+266
try_charge_memcg+412
charge_memcg+51
__mem_cgroup_charge+44
shmem_add_to_page_cache+545
shmem_get_folio_gfp+621
shmem_write_begin+95
generic_perform_write+257
__generic_file_write_iter+202
generic_file_write_iter+97
vfs_write+704
ksys_write+93
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+99
, 1]: 617065
>
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@...edance.com>
>
> Other than that:
>
> Acked-by: Vlastimil Babka <Vbabka@...e.cz>
Thanks.
>
> A small thing below:
>
>> ---
>> mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
>> 1 file changed, 27 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8515ac40bcaf..1de9bc3e5aa2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>> #include <linux/khugepaged.h>
>> #include <linux/rculist_nulls.h>
>> #include <linux/random.h>
>> +#include <linux/srcu.h>
>
> I guess this should have been in patch 2/8 already? It may work accidentaly
> because some other header pulls it transitively...
Yeah, in fact, patch 3/8 also can compile successfully without srcu.h,
but maybe it is better to explicitly include this header file, I will
add it in patch 2/8.
Thanks,
Qi
>
>
Powered by blists - more mailing lists