linux-kernel - Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Uf7V+wMa7zz+9j9gwHC+hia3OwL_bo_O-yhn4=Xh0WadA@mail.gmail.com>
Date: Tue, 10 Dec 2024 07:58:48 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com, 
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Shuah Khan <skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)

On Tue, Dec 10, 2024 at 4:27 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2024/12/10 0:03, Alexander Duyck wrote:
>
> ...
>
> >
> > Other than code size have you tried using perf to profile the
> > benchmark before and after. I suspect that would be telling about
> > which code changes are the most likely to be causing the issues.
> > Overall I don't think the size has increased all that much. I suspect
> > most of this is the fact that you are inlining more of the
> > functionality.
>
> It seems the testing result is very sensitive to code changing and
> reorganizing, as using the patch at the end to avoid the problem of
> 'perf stat' not including data from the kernel thread seems to provide
> more reasonable performance data.
>
> It seems the most obvious difference is 'insn per cycle' and I am not
> sure how to interpret the difference of below data for the performance
> degradation yet.
>
> With patch 1:
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5473.815250      task-clock (msec)         #    0.984 CPUs utilized
>                 18      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.022 K/sec
>        14210894727      cycles                    #    2.596 GHz                      (92.78%)
>        18903171767      instructions              #    1.33  insn per cycle           (92.82%)
>         2997494420      branches                  #  547.606 M/sec                    (92.84%)
>            7539978      branch-misses             #    0.25% of all branches          (92.84%)
>         6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
>           29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
>           57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
>             347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
>         5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
>             193877      L1-icache-load-misses                                         (92.91%)
>         6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
>             137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
>                210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.563068950 seconds time elapsed
>
> Without patch 1:
> root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5306.644600      task-clock (msec)         #    0.984 CPUs utilized
>                 15      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.023 K/sec
>        13776872322      cycles                    #    2.596 GHz                      (92.84%)
>        13257649773      instructions              #    0.96  insn per cycle           (92.82%)
>         2446901087      branches                  #  461.101 M/sec                    (92.91%)
>            7172751      branch-misses             #    0.29% of all branches          (92.84%)
>         5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
>           38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
>           65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
>             191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
>         4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
>             175208      L1-icache-load-misses                                         (92.91%)
>         5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
>             140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
>                170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.395104330 seconds time elapsed
>
>
> Below is perf data for aligned API without patch 1, as above non-aligned
> API also use test_alloc_len as 12, theoretically the performance data
> should not be better than the non-aligned API as the aligned API will do
> the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
> to show otherwise and I am not sure how to interpret that too:
> perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':
>
>        2447.553100      task-clock (msec)         #    0.965 CPUs utilized
>                  9      context-switches          #    0.004 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.050 K/sec
>         6354149177      cycles                    #    2.596 GHz                      (92.81%)
>         6467793726      instructions              #    1.02  insn per cycle           (92.76%)
>         1120749183      branches                  #  457.906 M/sec                    (92.81%)
>            7370402      branch-misses             #    0.66% of all branches          (92.81%)
>         2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
>           39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
>           42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
>              95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
>         2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
>             118902      L1-icache-load-misses                                         (92.97%)
>         3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
>              81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
>         2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
>                159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        2.535085780 seconds time elapsed

I'm not sure perf stat will tell us much as it is really too high
level to give us much in the way of details. I would be more
interested in the output from perf record -g followed by a perf
report, or maybe even just a snapshot from perf top while the test is
running. That should show us where the CPU is spending most of its
time and what areas are hot in the before and after graphs.