linux-kernel - Re: [PATCH] mm, slub: prefetch freelist in ___slab

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6e744d2b-bbb3-4e1f-bd61-e0e971f974db@huawei.com>
Date: Wed, 21 Aug 2024 14:58:23 +0800
From: Yongqiang Liu <liuyongqiang13@...wei.com>
To: Hyeonggon Yoo <42.hyeyoo@...il.com>
CC: <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
	<zhangxiaoxu5@...wei.com>, <cl@...ux.com>, <wangkefeng.wang@...wei.com>,
	<penberg@...nel.org>, <rientjes@...gle.com>, <iamjoonsoo.kim@....com>,
	<akpm@...ux-foundation.org>, <vbabka@...e.cz>, <roman.gushchin@...ux.dev>
Subject: Re: [PATCH] mm, slub: prefetch freelist in ___slab_alloc()


在 2024/8/19 17:33, Hyeonggon Yoo 写道:
> On Mon, Aug 19, 2024 at 4:02 PM Yongqiang Liu <liuyongqiang13@...wei.com> wrote:
>> commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
>> slab_alloc()") introduced prefetch_freepointer() for fastpath
>> allocation. Use it at the freelist firt load could have a bit
>> improvement in some workloads. Here is hackbench results at
>> arm64 machine(about 3.8%):
>>
>> Before:
>>    average time cost of 'hackbench -g 100 -l 1000': 17.068
>>
>> Afther:
>>    average time cost of 'hackbench -g 100 -l 1000': 16.416
>>
>> There is also having about 5% improvement at x86_64 machine
>> for hackbench.
> I think adding more prefetch might not be a good idea unless we have
> more real-world data supporting it because prefetch might help when slab
> is frequently used, but it will end up unnecessarily using more cache
> lines when slab is not frequently used.

Yes, prefetching unnecessary objects is a bad idea. But I think the slab 
entered

in slowpath that means it will more likely need more objects. I've 
tested the

cases from commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()"). Here is the result:

Before:

Performance counter stats for './hackbench 50 process 4000' (32 runs):

                 2545.28 msec task-clock                #    6.938 CPUs 
utilized        ( +-  1.75% )
                      6166     context-switches          #    0.002 
M/sec                    ( +-  1.58% )
                     1129      cpu-migrations            #    0.444 
K/sec                     ( +-  2.16% )
                   13298      page-faults                  # 0.005 
M/sec                    ( +-  0.38% )
         4435113150      cycles                           # 1.742 
GHz                         ( +-  1.22% )
         2259717630      instructions                 #    0.51 insn per 
cycle           ( +-  0.05% )
           385847392      branches                     #  151.593 
M/sec                    ( +-  0.06% )
              6205369       branch-misses            #    1.61% of all 
branches       ( +-  0.56% )

            0.36688 +- 0.00595 seconds time elapsed  ( +-  1.62% )
After:

  Performance counter stats for './hackbench 50 process 4000' (32 runs):

                2277.61 msec task-clock                #    6.855 CPUs 
utilized            ( +-  0.98% )
                     5653      context-switches         #    0.002 
M/sec                       ( +-  1.62% )
                     1081      cpu-migrations           #    0.475 
K/sec                        ( +-  1.89% )
                   13217      page-faults                 # 0.006 
M/sec                       ( +-  0.48% )
         3751509945      cycles                          #    1.647 
GHz                          ( +-  1.14% )
         2253177626      instructions                #    0.60 insn per 
cycle             ( +-  0.06% )
           384509166      branches                    #    168.821 
M/sec                    ( +-  0.07% )
               6045031      branch-misses           #    1.57% of all 
branches          ( +-  0.58% )

            0.33225 +- 0.00321 seconds time elapsed  ( +-  0.97% )

>
> Also I don't understand how adding prefetch in slowpath affects the performance
> because most allocs/frees should be done in the fastpath. Could you
> please explain?

By adding some debug info to count the slowpath for the hackbench:

'hackbench -g 100 -l 1000' slab alloc total: 80416886, and the slowpath: 
7184236.

About 9% slowpath in total allocation. The perf stats in arm64 as follow：

Before:
  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

        34766611220 branches                      ( +-  0.01% )
            382593804      branch-misses                  # 1.10% of all 
branches          ( +-  0.14% )
          1120091414 cache-misses                 ( +-  0.08% )
        76810485402 L1-dcache-loads               ( +-  0.03% )
          1120091414      L1-dcache-load-misses     #    1.46% of all 
L1-dcache hits    ( +-  0.08% )

            23.8854 +- 0.0804 seconds time elapsed  ( +-  0.34% )

After:
  Performance counter stats for './hackbench -g 100 -l 1000' (32 runs):

        34812735277 branches                  ( +-  0.01% )
            393449644      branch-misses             #    1.13% of all 
branches           ( +-  0.15% )
          1095185949 cache-misses             ( +-  0.15% )
        76995789602 L1-dcache-loads             ( +-  0.03% )
          1095185949      L1-dcache-load-misses     #    1.42% of all 
L1-dcache hits    ( +-  0.15% )

             23.341 +- 0.104 seconds time elapsed  ( +-  0.45% )

It seems having less L1-dcache-load-misses.

>
>> Signed-off-by: Yongqiang Liu <liuyongqiang13@...wei.com>
>> ---
>>   mm/slub.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index c9d8a2497fd6..f9daaff10c6a 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3630,6 +3630,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>          VM_BUG_ON(!c->slab->frozen);
>>          c->freelist = get_freepointer(s, freelist);
>>          c->tid = next_tid(c->tid);
>> +       prefetch_freepointer(s, c->freelist);
>>          local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>>          return freelist;
>>
>> --
>> 2.25.1
>>