lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <389876b8-e565-4dc9-bc87-d97a639ff585@huawei.com>
Date: Wed, 11 Dec 2024 20:52:15 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Shuah Khan
	<skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache
 (Part-2)

On 2024/12/10 23:58, Alexander Duyck wrote:

> 
> I'm not sure perf stat will tell us much as it is really too high
> level to give us much in the way of details. I would be more
> interested in the output from perf record -g followed by a perf
> report, or maybe even just a snapshot from perf top while the test is
> running. That should show us where the CPU is spending most of its
> time and what areas are hot in the before and after graphs.

It seems the bottleneck is in the freeing side that page_frag_free()
function took up to about 50% cpu for non-aligned API and 16% cpu
for aligned API in the push CPU using 'perf top'.

Using the below patch cause the page_frag_free() to disappear in the
push CPU  of 'perf top', new performance data is below:
Without patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         21.084113      task-clock (msec)         #    0.008 CPUs utilized            ( +-  1.59% )
                 7      context-switches          #    0.334 K/sec                    ( +-  1.25% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 20.20% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.26% )
          54748233      cycles                    #    2.597 GHz                      ( +-  1.59% )
          61637051      instructions              #    1.13  insn per cycle           ( +-  0.13% )
          14727268      branches                  #  698.501 M/sec                    ( +-  0.11% )
             20178      branch-misses             #    0.14% of all branches          ( +-  0.94% )

       2.637345524 seconds time elapsed                                          ( +-  0.19% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         19.669259      task-clock (msec)         #    0.009 CPUs utilized            ( +-  2.91% )
                 7      context-switches          #    0.356 K/sec                    ( +-  1.04% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +- 68.82% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.27% )
          51077447      cycles                    #    2.597 GHz                      ( +-  2.91% )
          58875368      instructions              #    1.15  insn per cycle           ( +-  4.47% )
          14040015      branches                  #  713.805 M/sec                    ( +-  4.68% )
             20150      branch-misses             #    0.14% of all branches          ( +-  0.64% )

       2.226539190 seconds time elapsed                                          ( +-  0.12% )

With patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         20.782788      task-clock (msec)         #    0.008 CPUs utilized            ( +-  0.09% )
                 7      context-switches          #    0.342 K/sec                    ( +-  0.97% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 16.83% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.31% )
          53967333      cycles                    #    2.597 GHz                      ( +-  0.08% )
          61577257      instructions              #    1.14  insn per cycle           ( +-  0.02% )
          14712140      branches                  #  707.900 M/sec                    ( +-  0.02% )
             20234      branch-misses             #    0.14% of all branches          ( +-  0.55% )

       2.677974457 seconds time elapsed                                          ( +-  0.15% )

root@(none):/home# perf stat -r 20 insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1

insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         20.420537      task-clock (msec)         #    0.009 CPUs utilized            ( +-  0.05% )
                 7      context-switches          #    0.345 K/sec                    ( +-  0.71% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +-100.00% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.23% )
          53038942      cycles                    #    2.597 GHz                      ( +-  0.05% )
          59965712      instructions              #    1.13  insn per cycle           ( +-  0.03% )
          14372507      branches                  #  703.826 M/sec                    ( +-  0.03% )
             20580      branch-misses             #    0.14% of all branches          ( +-  0.56% )

       2.287783171 seconds time elapsed                                          ( +-  0.12% )

It seems that bottleneck is still the freeing side that the above
result might not be as meaningful as it should be.

As we can't use more than one cpu for the free side without some
lock using a single ptr_ring, it seems something more complicated
might need to be done in order to support more than one CPU for the
freeing side?

Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
CPU using 'perf top'.
After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
took up to 4.67% + 1.01% = 5.68%.
Having a similar result, I am not sure if the CPU usages is able tell us
the performance degradation here as it seems to be quite large?

@@ -100,13 +100,20 @@ static int page_frag_push_thread(void *arg)
                if (!va)
                        continue;

-               ret = __ptr_ring_produce(ring, va);
-               if (ret) {
+               do {
+                       ret = __ptr_ring_produce(ring, va);
+                       if (!ret) {
+                               va = NULL;
+                               break;
+                       } else {
+                               cond_resched();
+                       }
+               } while (!force_exit);
+
+               if (va)
                        page_frag_free(va);
-                       cond_resched();
-               } else {
+               else
                        test_pushed++;
-               }
        }

        pr_info("page_frag push test thread exits on cpu %d\n",


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ