[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Uc5A_mtN_qxR6w5zqDbx87SUdCTFOBxVWCarnryRvhqHA@mail.gmail.com>
Date: Mon, 9 Dec 2024 08:03:51 -0800
From: Alexander Duyck <alexander.duyck@...il.com>
To: Yunsheng Lin <linyunsheng@...wei.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Shuah Khan <skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
On Mon, Dec 9, 2024 at 3:42 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2024/12/9 5:34, Alexander Duyck wrote:
>
> ...
>
> >>
> >> Performance validation for part2:
> >> 1. Using micro-benchmark ko added in patch 1 to test aligned and
> >> non-aligned API performance impact for the existing users, there
> >> seems to be about 20% performance degradation for refactoring
> >> page_frag to support the new API, which seems to nullify most of
> >> the performance gain in [3] of part1.
> >
> > So if I am understanding correctly then this is showing a 20%
> > performance degradation with this patchset. I would argue that it is
> > significant enough that it would be a blocking factor for this patch
> > set. I would suggest bisecting the patch set to identify where the
> > performance degradation has been added and see what we can do to
> > resolve it, and if nothing else document it in that patch so we can
> > identify the root cause for the slowdown.
>
> The only patch in this patchset affecting the performance of existing API
> seems to be patch 1, only including patch 1 does show ~20% performance
> degradation as including the whole patchset does:
> mm: page_frag: some minor refactoring before adding new API
>
> And the cause seems to be about the binary increasing as below, as the
> performance degradation didn't seems to change much when I tried inlining
> the __page_frag_cache_commit_noref() by moving it to the header file:
>
> ./scripts/bloat-o-meter vmlinux_orig vmlinux
> add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
> Function old new delta
> __page_frag_cache_prepare - 500 +500
> __napi_alloc_frag_align 68 180 +112
> __netdev_alloc_skb 488 596 +108
> napi_alloc_skb 556 624 +68
> __netdev_alloc_frag_align 196 252 +56
> svc_tcp_sendmsg 340 376 +36
> __page_frag_cache_commit_noref - 32 +32
> e843419@...6_0000bd47_30 - 8 +8
> e843419@...9_000044ee_684 8 - -8
> __page_frag_alloc_align 492 - -492
> Total: Before=34719207, After=34719627, chg +0.00%
>
> ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
> add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
> Function old new delta
> page_frag_push_thread 508 580 +72
> __UNIQUE_ID_vermagic367 67 73 +6
> Total: Before=4582, After=4660, chg +1.70%
Other than code size have you tried using perf to profile the
benchmark before and after. I suspect that would be telling about
which code changes are the most likely to be causing the issues.
Overall I don't think the size has increased all that much. I suspect
most of this is the fact that you are inlining more of the
functionality.
> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
> new API can make use of them as much as possible.
>
> Any better idea to reuse common codes as much as possible while avoiding
> the performance degradation as much as possible?
>
> >
> >> 2. Use the below netcat test case, there seems to be some minor
> >> performance gain for replacing 'page_frag' with 'page_frag_cache'
> >> using the new page_frag API after this patchset.
> >> server: taskset -c 32 nc -l -k 1234 > /dev/null
> >> client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> >
> > This test would barely touch the page pool. The fact is most of the
>
> I am guessing you meant page_frag here?
>
> > overhead for this would likely be things like TCP latency and data
> > copy much more than the page allocation. As such fluctuations here are
> > likely not related to your changes.
>
> But it does tell us something that the replacing does not seems to
> cause obvious regression, right?
Not really. The fragment allocator is such a small portion of this
test that we could probably double the cost for it and it would still
be negligible.
> I tried using a smaller MTU to amplify the impact of page allocation,
> it seemed to have a similar result.
Not surprising. However the network is likely only a small part of
this. I suspect if you ran a profile it would likely show the same.
Powered by blists - more mailing lists