lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <15d32b22-22b0-64e3-a49e-88d780c24616@kernel.org> Date: Mon, 7 Aug 2023 22:11:35 +0200 From: Jesper Dangaard Brouer <hawk@...nel.org> To: Jakub Kicinski <kuba@...nel.org>, Alexander H Duyck <alexander.duyck@...il.com>, Jesper Dangaard Brouer <hawk@...nel.org> Cc: Ratheesh Kannoth <rkannoth@...vell.com>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, davem@...emloft.net, edumazet@...gle.com, pabeni@...hat.com, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Alexander Lobakin <aleksander.lobakin@...el.com>, Yunsheng Lin <linyunsheng@...wei.com> Subject: Re: [PATCH net-next] page_pool: Clamp ring size to 32K On 07/08/2023 19.20, Jakub Kicinski wrote: > On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote: >>> Page pool (PP) is just a cache of pages. The driver octeontx2 (in link) >>> is creating an excessive large cache of pages. The drivers RX >>> descriptor ring size should be independent of the PP ptr_ring size, as >>> it is just a cache that grows as a functions of the in-flight packet >>> workload, it functions as a "shock absorber". >>> >>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. >>> >>> The RX-desc ring (obviously) pins down these pages (immediately), but PP >>> ring starts empty. As the workload varies the "shock absorber" effect >>> will let more pages into the system, that will travel the PP ptr_ring. >>> As all pages originating from the same PP instance will get recycled, >>> the in-flight pages in the "system" (PP ptr_ring) will grow over time. >>> >>> The PP design have the problem that it never releases or reduces pages >>> in this shock absorber "closed" system. (Cc. PP people/devel) we should >>> consider implementing a MM shrinker callback (include/linux/shrinker.h). >>> >>> Are the systems using driver octeontx2 ready to handle 128MiB memory per >>> RX-queue getting pinned down overtime? (this could lead to some strange >>> do debug situation if the memory is not sufficient) >> >> I'm with Jesper on this. It doesn't make sense to be tying the >> page_pool size strictly to the ring size. The amount of recycling you >> get will depend on how long the packets are on the stack, not in the >> driver. >> Thanks for agreeing with me, and I agree with you :-) >> For example, in the case of something like a software router or bridge >> that is just taking the Rx packets and routing them to Tx you could >> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you >> would likely never need much more than that as the Tx would likely be >> cleaned about as fast as the Rx can consume the pages. >> I agree. >> Rather than overriding the size here wouldn't it make more sense to do >> it in the octeontx2 driver? With that at least you would know that you >> were the one that limited the size instead of having the value modified >> out from underneath you. >> I'm not fully agreeing here. I don't think we can expect driver developer to be experts on page_pool cache dynamics. I'm more on Jakub's side here, as perhaps we/net-core can come up with some control system, even if this means we change this underneath drivers. >> That said, one change that might help to enable this kind of change >> would be look at adding a #define so that this value wouldn't be so >> much a magic number and would be visible to the drivers should it ever >> be changed in the future. > > All the points y'all making are valid, sizing the cache is a hard > problem. But the proposed solution goes in the wrong direction, IMO. > The driver doesn't know. I started hacking together page pool control > over netlink. I think that the pool size selection logic should be in > the core, with inputs taken from user space / workload (via netlink). > > If it wasn't for the fact that I'm working on that API I'd probably > side with you. And 64k descriptors is impractically large. > > Copy / pasting from the discussion on previous version: > > Tuning this in the driver relies on the assumption that the HW / > driver is the thing that matters. I'd think that the workload, > platform (CPU) and config (e.g. is IOMMU enabled?) will matter at > least as much. While driver developers will end up tuning to whatever > servers they have, random single config and most likely.. iperf. > > IMO it's much better to re-purpose "pool_size" and treat it as the ring > size, because that's what most drivers end up putting there. I disagree here, as driver developers should not treat "pool_size" as the ring size. It seems to be a copy-paste-programming scheme without understanding PP dynamics. > Defer tuning of the effective ring size to the core and user input > (via the "it will be added any minute now" netlink API for configuring > page pools)... > I agree here, that tuning ring size is a hard problem, and this is better handled in the core. Happy to hear, that/if Jakub is working on this. > So capping the recycle ring to 32k instead of returning the error seems > like an okay solution for now. As a temporary solution, I'm actually fine with capping at 32k. Driver developer loose some feedback control, but perhaps that is okay, if we can agree that the net-core should control tuning this anyhow. --Jesper
Powered by blists - more mailing lists