[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org>
Date: Mon, 7 Aug 2023 22:11:35 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>,
Alexander H Duyck <alexander.duyck@...il.com>,
Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Ratheesh Kannoth <rkannoth@...vell.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, davem@...emloft.net,
edumazet@...gle.com, pabeni@...hat.com,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Alexander Lobakin <aleksander.lobakin@...el.com>,
Yunsheng Lin <linyunsheng@...wei.com>
Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K
On 07/08/2023 19.20, Jakub Kicinski wrote:
> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>> Page pool (PP) is just a cache of pages. The driver octeontx2 (in link)
>>> is creating an excessive large cache of pages. The drivers RX
>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>> it is just a cache that grows as a functions of the in-flight packet
>>> workload, it functions as a "shock absorber".
>>>
>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>
>>> The RX-desc ring (obviously) pins down these pages (immediately), but PP
>>> ring starts empty. As the workload varies the "shock absorber" effect
>>> will let more pages into the system, that will travel the PP ptr_ring.
>>> As all pages originating from the same PP instance will get recycled,
>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>
>>> The PP design have the problem that it never releases or reduces pages
>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>> consider implementing a MM shrinker callback (include/linux/shrinker.h).
>>>
>>> Are the systems using driver octeontx2 ready to handle 128MiB memory per
>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>> do debug situation if the memory is not sufficient)
>>
>> I'm with Jesper on this. It doesn't make sense to be tying the
>> page_pool size strictly to the ring size. The amount of recycling you
>> get will depend on how long the packets are on the stack, not in the
>> driver.
>>
Thanks for agreeing with me, and I agree with you :-)
>> For example, in the case of something like a software router or bridge
>> that is just taking the Rx packets and routing them to Tx you could
>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>> would likely never need much more than that as the Tx would likely be
>> cleaned about as fast as the Rx can consume the pages.
>>
I agree.
>> Rather than overriding the size here wouldn't it make more sense to do
>> it in the octeontx2 driver? With that at least you would know that you
>> were the one that limited the size instead of having the value modified
>> out from underneath you.
>>
I'm not fully agreeing here. I don't think we can expect driver
developer to be experts on page_pool cache dynamics. I'm more on
Jakub's side here, as perhaps we/net-core can come up with some control
system, even if this means we change this underneath drivers.
>> That said, one change that might help to enable this kind of change
>> would be look at adding a #define so that this value wouldn't be so
>> much a magic number and would be visible to the drivers should it ever
>> be changed in the future.
>
> All the points y'all making are valid, sizing the cache is a hard
> problem. But the proposed solution goes in the wrong direction, IMO.
> The driver doesn't know. I started hacking together page pool control
> over netlink. I think that the pool size selection logic should be in
> the core, with inputs taken from user space / workload (via netlink).
>
> If it wasn't for the fact that I'm working on that API I'd probably
> side with you. And 64k descriptors is impractically large.
>
> Copy / pasting from the discussion on previous version:
>
> Tuning this in the driver relies on the assumption that the HW /
> driver is the thing that matters. I'd think that the workload,
> platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
> least as much. While driver developers will end up tuning to whatever
> servers they have, random single config and most likely.. iperf.
>
> IMO it's much better to re-purpose "pool_size" and treat it as the ring
> size, because that's what most drivers end up putting there.
I disagree here, as driver developers should not treat "pool_size" as
the ring size. It seems to be a copy-paste-programming scheme without
understanding PP dynamics.
> Defer tuning of the effective ring size to the core and user input
> (via the "it will be added any minute now" netlink API for configuring
> page pools)...
>
I agree here, that tuning ring size is a hard problem, and this is
better handled in the core. Happy to hear, that/if Jakub is working on
this.
> So capping the recycle ring to 32k instead of returning the error seems
> like an okay solution for now.
As a temporary solution, I'm actually fine with capping at 32k.
Driver developer loose some feedback control, but perhaps that is okay,
if we can agree that the net-core should control tuning this anyhow.
--Jesper
Powered by blists - more mailing lists