[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e744d2b-bbb3-4e1f-bd61-e0e971f974db@huawei.com>
Date: Wed, 21 Aug 2024 14:58:23 +0800
From: Yongqiang Liu <liuyongqiang13@...wei.com>
To: Hyeonggon Yoo <42.hyeyoo@...il.com>
CC: <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
<zhangxiaoxu5@...wei.com>, <cl@...ux.com>, <wangkefeng.wang@...wei.com>,
<penberg@...nel.org>, <rientjes@...gle.com>, <iamjoonsoo.kim@....com>,
<akpm@...ux-foundation.org>, <vbabka@...e.cz>, <roman.gushchin@...ux.dev>
Subject: Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc()
在 2024/8/19 17:33, Hyeonggon Yoo 写道:
> On Mon, Aug 19, 2024 at 4:02 PM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
>> slab_alloc()") introduced prefetch_freepointer() for fastpath
>> allocation. Use it at the freelist firt load could have a bit
>> improvement in some workloads. Here is hackbench results at
>> arm64 machine(about 3.8%):
>>
>> Before:
>> average time cost of 'hackbench -g 100 -l 1000': 17.068
>>
>> Afther:
>> average time cost of 'hackbench -g 100 -l 1000': 16.416
>>
>> There is also having about 5% improvement at x86_64 machine
>> for hackbench.
> I think adding more prefetch might not be a good idea unless we have
> more real-world data supporting it because prefetch might help when slab
> is frequently used, but it will end up unnecessarily using more cache
> lines when slab is not frequently used.
Yes, prefetching unnecessary objects is a bad idea. But I think the slab
entered
in slowpath that means it will more likely need more objects. I've
tested the
cases from commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()"). Here is the result:
Before:
Performance counter stats for './hackbench 50 process 4000' (32 runs):
2545.28 msec task-clock # 6.938 CPUs
utilized ( +- 1.75% )
6166 context-switches # 0.002
M/sec ( +- 1.58% )
1129 cpu-migrations # 0.444
K/sec ( +- 2.16% )
13298 page-faults # 0.005
M/sec ( +- 0.38% )
4435113150 cycles # 1.742
GHz ( +- 1.22% )
2259717630 instructions # 0.51 insn per
cycle ( +- 0.05% )
385847392 branches # 151.593
M/sec ( +- 0.06% )
6205369 branch-misses # 1.61% of all
branches ( +- 0.56% )
0.36688 +- 0.00595 seconds time elapsed ( +- 1.62% )
After:
Performance counter stats for './hackbench 50 process 4000' (32 runs):
2277.61 msec task-clock # 6.855 CPUs
utilized ( +- 0.98% )
5653 context-switches # 0.002
M/sec ( +- 1.62% )
1081 cpu-migrations # 0.475
K/sec ( +- 1.89% )
13217 page-faults # 0.006
M/sec ( +- 0.48% )
3751509945 cycles # 1.647
GHz ( +- 1.14% )
2253177626 instructions # 0.60 insn per
cycle ( +- 0.06% )
384509166 branches # 168.821
M/sec ( +- 0.07% )
6045031 branch-misses # 1.57% of all
branches ( +- 0.58% )
0.33225 +- 0.00321 seconds time elapsed ( +- 0.97% )
>
> Also I don't understand how adding prefetch in slowpath affects the performance
> because most allocs/frees should be done in the fastpath. Could you
> please explain?
By adding some debug info to count the slowpath for the hackbench:
'hackbench -g 100 -l 1000' slab alloc total: 80416886, and the slowpath:
7184236.
About 9% slowpath in total allocation. The perf stats in arm64 as follow:
Before:
Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):
34766611220 branches ( +- 0.01% )
382593804 branch-misses # 1.10% of all
branches ( +- 0.14% )
1120091414 cache-misses ( +- 0.08% )
76810485402 L1-dcache-loads ( +- 0.03% )
1120091414 L1-dcache-load-misses # 1.46% of all
L1-dcache hits ( +- 0.08% )
23.8854 +- 0.0804 seconds time elapsed ( +- 0.34% )
After:
Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):
34812735277 branches ( +- 0.01% )
393449644 branch-misses # 1.13% of all
branches ( +- 0.15% )
1095185949 cache-misses ( +- 0.15% )
76995789602 L1-dcache-loads ( +- 0.03% )
1095185949 L1-dcache-load-misses # 1.42% of all
L1-dcache hits ( +- 0.15% )
23.341 +- 0.104 seconds time elapsed ( +- 0.45% )
It seems having less L1-dcache-load-misses.
>
>> Signed-off-by: Yongqiang Liu <liuyongqiang13@...wei.com>
>> ---
>> mm/slub.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index c9d8a2497fd6..f9daaff10c6a 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3630,6 +3630,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> VM_BUG_ON(!c->slab->frozen);
>> c->freelist = get_freepointer(s, freelist);
>> c->tid = next_tid(c->tid);
>> + prefetch_freepointer(s, c->freelist);
>> local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>> return freelist;
>>
>> --
>> 2.25.1
>>
Powered by blists - more mailing lists