[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58a592bf-3e88-4ad4-8a6e-37dd9319da99@gmail.com>
Date: Fri, 1 Aug 2025 10:48:57 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Stanislav Fomichev <stfomichev@...il.com>,
Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
io-uring@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
Willem de Bruijn <willemb@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
andrew+netdev@...n.ch, horms@...nel.org, davem@...emloft.net,
sdf@...ichev.me, dw@...idwei.uk, michael.chan@...adcom.com,
dtatulea@...dia.com, ap420073@...il.com
Subject: Re: [RFC v1 00/22] Large rx buffer support for zcrx
On 7/31/25 21:05, Mina Almasry wrote:
> On Thu, Jul 31, 2025 at 12:56 PM Pavel Begunkov <asml.silence@...il.com> wrote:
>>
...>>>>>> If the setup is done outside, you can also setup rx-buf-len outside, no?
>>>>>
>>>>> You can't do it without assuming the memory layout, and that's
>>>>> the application's role to allocate buffers. Not to mention that
>>>>> often the app won't know about all specifics either and it'd be
>>>>> resolved on zcrx registration.
>>>>
>>>> I think, fundamentally, we need to distinguish:
>>>>
>>>> 1. chunk size of the memory pool (page pool order, niov size)
>>>> 2. chunk size of the rx queue entries (this is what this series calls
>>>> rx-buf-len), mostly influenced by MTU?
>>>>
>>>> For devmem (and same for iou?), we want an option to derive (2) from (1):
>>>> page pools with larger chunks need to generate larger rx entries.
>>>
>>> To be honest I'm not following. #1 and #2 seem the same to me.
>>> rx-buf-len is just the size of each rx buffer posted to the NIC.
>>>
>>> With pp_params.order = 0 (most common configuration today), rx-buf-len
>>> == 4K. Regardless of MTU. With pp_params.order=1, I'm guessing 8K
>>> then, again regardless of MTU.
>>
>> There are drivers that fragment the buffer they get from a page
>> pool and give smaller chunks to the hw. It's surely a good idea to
>> be more explicit on what's what, but from the whole setup and uapi
>> perspective I'm not too concerned.
>>
>> The parameter the user passes to zcrx must controls 1. As for 2.
>> I'd expect the driver to use the passed size directly or fail
>> validation, but even if that's not the case, zcrx / devmem would
>> just continue to work without any change in uapi, so we have
>> the freedom to patch up the nuances later on if anything sticks
>> out.
>>
>
> I indeed forgot about driver-fragmenting. That does complicate things
> quite a bit.
>
> So AFAIU the intended behavior is that rx-buf-len refers to the memory
> size allocated by the driver (and thun memory provider), but not
> necessarily the one posted by the driver if it's fragmenting that
> piece of memory? If so, that sounds good to me. Although I wonder if
Yep
> that could cause some unexpected behavior... Someone may configure
> rx-buf-len to 8K on one driver and get actual 8K packets, but then
> configure rx-buf-len on another driver and get 4K packets because the
> driver fragmented each buffer into 2...
That already can happen, the user can hope to get whole full buffers
but shouldn't assume that it will. hw gro can't be 100% reliable in
this sense for all circumstances. And I don't think it's sane for
driver implementations to do that. Fragmenting PAGE_SIZE because the
NIC needs smaller chunks or for some other compatibility reasons?
Sure, but then I don't see a reason for validating even larger buffers.
> I guess in the future there may be a knob that controls how much
> fragmentation the driver does?
Probably, but hopefully it'll not be needed
--
Pavel Begunkov
Powered by blists - more mailing lists