lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Oct 2021 10:33:02 +0000
From:   Hyeonggon Yoo <42.hyeyoo@...il.com>
To:     Vlastimil Babka <vbabka@...e.cz>
Cc:     David Rientjes <rientjes@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Perf and Hackbench results on my machine

Hello Vlastimil.

On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> On 10/11/21 00:49, David Rientjes wrote:
> > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> > 
> >> It's certain that an object will be not only read, but also
> >> written after allocation.
> >> 
> > 
> > Why is it certain?  I think perhaps what you meant to say is that if we 
> > are doing any prefetching here, then access will benefit from prefetchw 
> > instead of prefetch.  But it's not "certain" that allocated memory will be 
> > accessed at all.
> 
> I think the primary reason there's a prefetch is freelist traversal. The
> cacheline we prefetch will be read during the next allocation, so if we
> expect there to be one soon, prefetch might help.

I agree that.

> That the freepointer is
> part of object itself and thus the cache line will be probably accessed also
> after the allocation, is secondary.

Right. it depends on cache line size and whether first cache line of an
object is frequently accessed or not.

> Yeah this might help some workloads, but
> perhaps hurt others - these things might look obvious in theory but be
> rather unpredictable in practice. At least some hackbench results would help...
>

Below is my measurement. it seems prefetch(w) is not making things worse
at least on hackbench.

Measured on 16 CPUs (ARM64) / 16G RAM
Without prefetch:

Time: 91.989
 Performance counter stats for 'hackbench -g 100 -l 10000':
        1467926.03 msec cpu-clock                 #   15.907 CPUs utilized          
          17782076      context-switches          #   12.114 K/sec                  
            957523      cpu-migrations            #  652.296 /sec                   
            104561      page-faults               #   71.230 /sec                   
     1622117569931      cycles                    #    1.105 GHz                      (54.54%)
     2002981132267      instructions              #    1.23  insn per cycle           (54.32%)
        5600876429      branch-misses                                                 (54.28%)
      642657442307      cache-references          #  437.800 M/sec                    (54.27%)
       19404890844      cache-misses              #    3.019 % of all cache refs      (54.28%)
      640413686039      L1-dcache-loads           #  436.271 M/sec                    (46.85%)
       19110650580      L1-dcache-load-misses     #    2.98% of all L1-dcache accesses  (46.83%)
      651556334841      dTLB-loads                #  443.862 M/sec                    (46.63%)
        3193647402      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.84%)
      538927659684      iTLB-loads                #  367.135 M/sec                    (54.31%)
         118503839      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      625750168840      L1-icache-loads           #  426.282 M/sec                    (46.80%)
       24348083282      L1-icache-load-misses     #    3.89% of all L1-icache accesses  (46.78%)

      92.284351157 seconds time elapsed

      44.524693000 seconds user
    1426.214006000 seconds sys

With prefetch:

Time: 91.677

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1462938.07 msec cpu-clock                 #   15.908 CPUs utilized          
          18072550      context-switches          #   12.354 K/sec                  
           1018814      cpu-migrations            #  696.416 /sec                   
            104558      page-faults               #   71.471 /sec                   
     2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
        5702204863      branch-misses                                                 (54.28%)
      643368500985      cache-references          #  439.778 M/sec                    (54.26%)
       18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
      642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
       18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
      653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
        3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
      537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
         114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
      630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
       22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)
 
      91.964452802 seconds time elapsed

      43.416742000 seconds user
    1422.441123000 seconds sys
	
With prefetchw:

Time: 90.220

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1437418.48 msec cpu-clock                 #   15.880 CPUs utilized          
          17694068      context-switches          #   12.310 K/sec                  
            958257      cpu-migrations            #  666.651 /sec                   
            100604      page-faults               #   69.989 /sec                   
     1583259429428      cycles                    #    1.101 GHz                      (54.57%)
     2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
        5594202389      branch-misses                                                 (54.36%)
      643113574524      cache-references          #  447.409 M/sec                    (54.39%)
       18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
      640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
       17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
      651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
        3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
      535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
         113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
       22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)
 
      90.514819303 seconds time elapsed
 
      43.877656000 seconds user
    1397.176001000 seconds sys

Thanks,
Hyeonggon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ