[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <389876b8-e565-4dc9-bc87-d97a639ff585@huawei.com>
Date: Wed, 11 Dec 2024 20:52:15 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Shuah Khan
<skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache
(Part-2)
On 2024/12/10 23:58, Alexander Duyck wrote:
>
> I'm not sure perf stat will tell us much as it is really too high
> level to give us much in the way of details. I would be more
> interested in the output from perf record -g followed by a perf
> report, or maybe even just a snapshot from perf top while the test is
> running. That should show us where the CPU is spending most of its
> time and what areas are hot in the before and after graphs.
It seems the bottleneck is in the freeing side that page_frag_free()
function took up to about 50% cpu for non-aligned API and 16% cpu
for aligned API in the push CPU using 'perf top'.
Using the below patch cause the page_frag_free() to disappear in the
push CPU of 'perf top', new performance data is below:
Without patch 1:
Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):
21.084113 task-clock (msec) # 0.008 CPUs utilized ( +- 1.59% )
7 context-switches # 0.334 K/sec ( +- 1.25% )
1 cpu-migrations # 0.031 K/sec ( +- 20.20% )
78 page-faults # 0.004 M/sec ( +- 0.26% )
54748233 cycles # 2.597 GHz ( +- 1.59% )
61637051 instructions # 1.13 insn per cycle ( +- 0.13% )
14727268 branches # 698.501 M/sec ( +- 0.11% )
20178 branch-misses # 0.14% of all branches ( +- 0.94% )
2.637345524 seconds time elapsed ( +- 0.19% )
Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):
19.669259 task-clock (msec) # 0.009 CPUs utilized ( +- 2.91% )
7 context-switches # 0.356 K/sec ( +- 1.04% )
0 cpu-migrations # 0.005 K/sec ( +- 68.82% )
77 page-faults # 0.004 M/sec ( +- 0.27% )
51077447 cycles # 2.597 GHz ( +- 2.91% )
58875368 instructions # 1.15 insn per cycle ( +- 4.47% )
14040015 branches # 713.805 M/sec ( +- 4.68% )
20150 branch-misses # 0.14% of all branches ( +- 0.64% )
2.226539190 seconds time elapsed ( +- 0.12% )
With patch 1:
Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):
20.782788 task-clock (msec) # 0.008 CPUs utilized ( +- 0.09% )
7 context-switches # 0.342 K/sec ( +- 0.97% )
1 cpu-migrations # 0.031 K/sec ( +- 16.83% )
78 page-faults # 0.004 M/sec ( +- 0.31% )
53967333 cycles # 2.597 GHz ( +- 0.08% )
61577257 instructions # 1.14 insn per cycle ( +- 0.02% )
14712140 branches # 707.900 M/sec ( +- 0.02% )
20234 branch-misses # 0.14% of all branches ( +- 0.55% )
2.677974457 seconds time elapsed ( +- 0.15% )
root@(none):/home# perf stat -r 20 insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):
20.420537 task-clock (msec) # 0.009 CPUs utilized ( +- 0.05% )
7 context-switches # 0.345 K/sec ( +- 0.71% )
0 cpu-migrations # 0.005 K/sec ( +-100.00% )
77 page-faults # 0.004 M/sec ( +- 0.23% )
53038942 cycles # 2.597 GHz ( +- 0.05% )
59965712 instructions # 1.13 insn per cycle ( +- 0.03% )
14372507 branches # 703.826 M/sec ( +- 0.03% )
20580 branch-misses # 0.14% of all branches ( +- 0.56% )
2.287783171 seconds time elapsed ( +- 0.12% )
It seems that bottleneck is still the freeing side that the above
result might not be as meaningful as it should be.
As we can't use more than one cpu for the free side without some
lock using a single ptr_ring, it seems something more complicated
might need to be done in order to support more than one CPU for the
freeing side?
Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
CPU using 'perf top'.
After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
took up to 4.67% + 1.01% = 5.68%.
Having a similar result, I am not sure if the CPU usages is able tell us
the performance degradation here as it seems to be quite large?
@@ -100,13 +100,20 @@ static int page_frag_push_thread(void *arg)
if (!va)
continue;
- ret = __ptr_ring_produce(ring, va);
- if (ret) {
+ do {
+ ret = __ptr_ring_produce(ring, va);
+ if (!ret) {
+ va = NULL;
+ break;
+ } else {
+ cond_resched();
+ }
+ } while (!force_exit);
+
+ if (va)
page_frag_free(va);
- cond_resched();
- } else {
+ else
test_pushed++;
- }
}
pr_info("page_frag push test thread exits on cpu %d\n",
Powered by blists - more mailing lists