linux-kernel - Re: [PATCH net-next] page_pool: Clamp ring size to 32K

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org>
Date:   Mon, 7 Aug 2023 22:11:35 +0200
From:   Jesper Dangaard Brouer <hawk@...nel.org>
To:     Jakub Kicinski <kuba@...nel.org>,
        Alexander H Duyck <alexander.duyck@...il.com>,
        Jesper Dangaard Brouer <hawk@...nel.org>
Cc:     Ratheesh Kannoth <rkannoth@...vell.com>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, davem@...emloft.net,
        edumazet@...gle.com, pabeni@...hat.com,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Alexander Lobakin <aleksander.lobakin@...el.com>,
        Yunsheng Lin <linyunsheng@...wei.com>
Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K



On 07/08/2023 19.20, Jakub Kicinski wrote:
> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in link)
>>> is creating an excessive large cache of pages.  The drivers RX
>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>> it is just a cache that grows as a functions of the in-flight packet
>>> workload, it functions as a "shock absorber".
>>>
>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>
>>> The RX-desc ring (obviously) pins down these pages (immediately), but PP
>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>> will let more pages into the system, that will travel the PP ptr_ring.
>>> As all pages originating from the same PP instance will get recycled,
>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>
>>> The PP design have the problem that it never releases or reduces pages
>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>> consider implementing a MM shrinker callback (include/linux/shrinker.h).
>>>
>>> Are the systems using driver octeontx2 ready to handle 128MiB memory per
>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>> do debug situation if the memory is not sufficient)
>>
>> I'm with Jesper on this. It doesn't make sense to be tying the
>> page_pool size strictly to the ring size. The amount of recycling you
>> get will depend on how long the packets are on the stack, not in the
>> driver.
>>

Thanks for agreeing with me, and I agree with you :-)

>> For example, in the case of something like a software router or bridge
>> that is just taking the Rx packets and routing them to Tx you could
>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>> would likely never need much more than that as the Tx would likely be
>> cleaned about as fast as the Rx can consume the pages.
>>

I agree.

>> Rather than overriding the size here wouldn't it make more sense to do
>> it in the octeontx2 driver? With that at least you would know that you
>> were the one that limited the size instead of having the value modified
>> out from underneath you.
>>

I'm not fully agreeing here.  I don't think we can expect driver
developer to be experts on page_pool cache dynamics.  I'm more on
Jakub's side here, as perhaps we/net-core can come up with some control
system, even if this means we change this underneath drivers.


>> That said, one change that might help to enable this kind of change
>> would be look at adding a #define so that this value wouldn't be so
>> much a magic number and would be visible to the drivers should it ever
>> be changed in the future.
> 
> All the points y'all making are valid, sizing the cache is a hard
> problem. But the proposed solution goes in the wrong direction, IMO.
> The driver doesn't know. I started hacking together page pool control
> over netlink. I think that the pool size selection logic should be in
> the core, with inputs taken from user space / workload (via netlink).
> 
> If it wasn't for the fact that I'm working on that API I'd probably
> side with you. And 64k descriptors is impractically large.
> 
> Copy / pasting from the discussion on previous version:
> 
>    Tuning this in the driver relies on the assumption that the HW /
>    driver is the thing that matters. I'd think that the workload,
>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>    least as much. While driver developers will end up tuning to whatever
>    servers they have, random single config and most likely.. iperf.
> 
>    IMO it's much better to re-purpose "pool_size" and treat it as the ring
>    size, because that's what most drivers end up putting there.

I disagree here, as driver developers should not treat "pool_size" as
the ring size.  It seems to be a copy-paste-programming scheme without
understanding PP dynamics.

>    Defer tuning of the effective ring size to the core and user input
>    (via the "it will be added any minute now" netlink API for configuring
>    page pools)...
> 

I agree here, that tuning ring size is a hard problem, and this is
better handled in the core.  Happy to hear, that/if Jakub is working on
this.

>    So capping the recycle ring to 32k instead of returning the error seems
>    like an okay solution for now.

As a temporary solution, I'm actually fine with capping at 32k.
Driver developer loose some feedback control, but perhaps that is okay,
if we can agree that the net-core should control tuning this anyhow.

--Jesper