linux-kernel - Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15723762-7800-4498-845e-7383a88f147b@huawei.com>
Date: Tue, 10 Dec 2024 20:27:45 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Shuah Khan
	<skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache
 (Part-2)

On 2024/12/10 0:03, Alexander Duyck wrote:

...

> 
> Other than code size have you tried using perf to profile the
> benchmark before and after. I suspect that would be telling about
> which code changes are the most likely to be causing the issues.
> Overall I don't think the size has increased all that much. I suspect
> most of this is the fact that you are inlining more of the
> functionality.

It seems the testing result is very sensitive to code changing and
reorganizing, as using the patch at the end to avoid the problem of
'perf stat' not including data from the kernel thread seems to provide
more reasonable performance data.

It seems the most obvious difference is 'insn per cycle' and I am not
sure how to interpret the difference of below data for the performance
degradation yet.

With patch 1:
 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5473.815250      task-clock (msec)         #    0.984 CPUs utilized
                18      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.022 K/sec
       14210894727      cycles                    #    2.596 GHz                      (92.78%)
       18903171767      instructions              #    1.33  insn per cycle           (92.82%)
        2997494420      branches                  #  547.606 M/sec                    (92.84%)
           7539978      branch-misses             #    0.25% of all branches          (92.84%)
        6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
          29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
          57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
            347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
        5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
            193877      L1-icache-load-misses                                         (92.91%)
        6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
            137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
               210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.563068950 seconds time elapsed

Without patch 1:
root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5306.644600      task-clock (msec)         #    0.984 CPUs utilized
                15      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.023 K/sec
       13776872322      cycles                    #    2.596 GHz                      (92.84%)
       13257649773      instructions              #    0.96  insn per cycle           (92.82%)
        2446901087      branches                  #  461.101 M/sec                    (92.91%)
           7172751      branch-misses             #    0.29% of all branches          (92.84%)
        5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
          38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
          65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
            191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
        4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
            175208      L1-icache-load-misses                                         (92.91%)
        5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
            140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
               170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.395104330 seconds time elapsed


Below is perf data for aligned API without patch 1, as above non-aligned
API also use test_alloc_len as 12, theoretically the performance data
should not be better than the non-aligned API as the aligned API will do
the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
to show otherwise and I am not sure how to interpret that too:
perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':

       2447.553100      task-clock (msec)         #    0.965 CPUs utilized
                 9      context-switches          #    0.004 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.050 K/sec
        6354149177      cycles                    #    2.596 GHz                      (92.81%)
        6467793726      instructions              #    1.02  insn per cycle           (92.76%)
        1120749183      branches                  #  457.906 M/sec                    (92.81%)
           7370402      branch-misses             #    0.66% of all branches          (92.81%)
        2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
          39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
          42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
             95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
        2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
            118902      L1-icache-load-misses                                         (92.97%)
        3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
             81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
        2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
               159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       2.535085780 seconds time elapsed


> 
>> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
>> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
>> new API can make use of them as much as possible.
>>
>> Any better idea to reuse common codes as much as possible while avoiding
>> the performance degradation as much as possible?
>>
>>>
>>>> 2. Use the below netcat test case, there seems to be some minor
>>>>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>>>>    using the new page_frag API after this patchset.
>>>>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>>>>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>>>
>>> This test would barely touch the page pool. The fact is most of the
>>
>> I am guessing you meant page_frag here?
>>
>>> overhead for this would likely be things like TCP latency and data
>>> copy much more than the page allocation. As such fluctuations here are
>>> likely not related to your changes.
>>
>> But it does tell us something that the replacing does not seems to
>> cause obvious regression, right?
> 
> Not really. The fragment allocator is such a small portion of this
> test that we could probably double the cost for it and it would still
> be negligible.

The most beneficial thing for replacing of the old API seems to be about
batching of page->_refcount updating and avoid some page_address(), but
may have overhead from unifying of page_frag API.

> 
>> I tried using a smaller MTU to amplify the impact of page allocation,
>> it seemed to have a similar result.
> 
> Not surprising. However the network is likely only a small part of
> this. I suspect if you ran a profile it would likely show the same.
> 

patch for doing the push operation in the insmod process instead of
in the kernel thread as 'perf stat' does not seem to include the data
of kernel thread:
diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c
index e806c1866e36..a818431c38b8 100644
--- a/tools/testing/selftests/mm/page_frag/page_frag_test.c
+++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c
@@ -131,30 +131,39 @@ static int __init page_frag_test_init(void)
        init_completion(&wait);

        if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0 ||
-           !cpu_active(test_push_cpu) || !cpu_active(test_pop_cpu))
+           !cpu_active(test_pop_cpu))
                return -EINVAL;

        ret = ptr_ring_init(&ptr_ring, nr_objs, GFP_KERNEL);
        if (ret)
                return ret;

-       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
-                                        test_push_cpu, "page_frag_push");
-       if (IS_ERR(tsk_push))
-               return PTR_ERR(tsk_push);
-
        tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_ring,
                                        test_pop_cpu, "page_frag_pop");
-       if (IS_ERR(tsk_pop)) {
-               kthread_stop(tsk_push);
+       if (IS_ERR(tsk_pop))
                return PTR_ERR(tsk_pop);
+
+       pr_info("test_push_cpu = %d\n", test_push_cpu);
+
+       if (test_push_cpu < 0)
+               goto skip_push_thread;
+
+       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
+                                        test_push_cpu, "page_frag_push");
+       if (IS_ERR(tsk_push)) {
+               kthread_stop(tsk_pop);
+               return PTR_ERR(tsk_push);
        }

+skip_push_thread:
        start = ktime_get();
-       wake_up_process(tsk_push);
+       pr_info("waiting for test to complete\n");
        wake_up_process(tsk_pop);

-       pr_info("waiting for test to complete\n");
+       if (test_push_cpu < 0)
+               page_frag_push_thread(&ptr_ring);
+       else
+               wake_up_process(tsk_push);

        while (!wait_for_completion_timeout(&wait, msecs_to_jiffies(10000))) {
                /* exit if there is no progress for push or pop size */