[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251016184031.66c92962@kernel.org>
Date: Thu, 16 Oct 2025 18:40:31 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Pavel Begunkov <asml.silence@...il.com>, netdev@...r.kernel.org, Andrew
Lunn <andrew@...n.ch>, davem@...emloft.net, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Simon Horman
<horms@...nel.org>, Donald Hunter <donald.hunter@...il.com>, Michael Chan
<michael.chan@...adcom.com>, Pavan Chebbi <pavan.chebbi@...adcom.com>,
Jesper Dangaard Brouer <hawk@...nel.org>, John Fastabend
<john.fastabend@...il.com>, Stanislav Fomichev <sdf@...ichev.me>, Joshua
Washington <joshwash@...gle.com>, Harshitha Ramamurthy
<hramamurthy@...gle.com>, Jian Shen <shenjian15@...wei.com>, Salil Mehta
<salil.mehta@...wei.com>, Jijie Shao <shaojijie@...wei.com>, Sunil Goutham
<sgoutham@...vell.com>, Geetha sowjanya <gakula@...vell.com>, Subbaraya
Sundeep <sbhatta@...vell.com>, hariprasad <hkelam@...vell.com>, Bharat
Bhushan <bbhushan2@...vell.com>, Saeed Mahameed <saeedm@...dia.com>, Tariq
Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>, Alexander Duyck
<alexanderduyck@...com>, kernel-team@...a.com, Ilias Apalodimas
<ilias.apalodimas@...aro.org>, Joe Damato <joe@...a.to>, David Wei
<dw@...idwei.uk>, Willem de Bruijn <willemb@...gle.com>, Breno Leitao
<leitao@...ian.org>, Dragos Tatulea <dtatulea@...dia.com>,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, Jonathan Corbet
<corbet@....net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large
buffer providers
On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> I think what you're saying is what I was trying to say, but you said
> it more eloquently and genetically correct. I'm not familiar with the
> GRO packing you're referring to so. I just assumed the 'buffer sizes
> actually posted to the NIC' are the 'buffer sizes we end up seeing in
> the skb frags'.
I don't think that code path exists today, buffers posted are frags
in the skb. But that's easily fixable.
> I guess what I'm trying to say in a different way, is: there are lots
> of buffer sizes in the rx path, AFAICT, at least:
>
> 1. The size of the allocated netmems from the pp.
> 2. The size of the buffers posted to the NIC (which will be different
> from #1 if the page_pool_fragment_netmem or some other trick like
> hns3).
> 3. The size of the frags that end up in the skb (which will be
> different from #2 for GRO/other things I don't fully understand).
>
> ...and I'm not sure what rx-buf-len should actually configure. My
> thinking is that it probably should configure #3, since that is what
> the user cares about, I agree with that.
>
> IIRC when I last looked at this a few weeks ago, I think as written
> this patch series makes rx-buf-len actually configure #1.
#1 or #2. #1 for otx2. For the RFC bnxt implementation they were
equivalent. But hns3's reading would be that it's #2.
From user PoV neither #1 nor #2 is particularly meaningful.
Assuming driver can fragment - #1 only configures memory accounting
blocks. #2 configures buffers passed to the HW, but some HW can pack
payloads into a single buf to save memory. Which means that if previous
frame was small and ate some of a page, subsequent large frame of
size M may not fit into a single buf of size X, even if M < X.
So I think the full set of parameters we should define would be
what you defined as #1 and #2. And on top of that we need some kind of
min alignment enforcement. David Wei mentioned that one of his main use
cases is ZC of a buffer which is then sent to storage, which has strict
alignment requirements. And some NICs will internally fragment the
page.. Maybe let's define the expected device behavior..
Device models
=============
Assume we receive 2 5kB packets, "x" means bytes from first packet,
"y" means bytes from the second packet.
A. Basic-scatter
----------------
Packet uses one or more buffers, so 1:n mapping between packets and
buffers.
unused space
v
1kB [xx] [xx] [x ] [yy] [yy] [y ]
16kB [xxxxx ] [yyyyy ]
B. Multi-packet
---------------
The configurations above are still possible, but we can configure
the device to place multiple packets in a large page:
unused space
v
16kB, 2kB [xxxxx |yyyyy |...] [..................]
^
alignment / stride
We can probably assume that this model always comes with alignment
cause DMA'ing frames at odd offsets is a bad idea. And also note
that packets smaller that alignment can get scattered to multiple
bufs.
C. Multi-packet HW-GRO
----------------------
For completeness, I guess. We need a third packet here. Assume x-packet
and z-packet are from the same flow and GRO session, y-packet is not.
(Good?) HW-GRO gives us out of order placement and hopefully in this
case we do want to pack:
16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
^
alignment / stride
End of sidebar. I think / hope these are all practical buffer layouts
we need to care about.
What does user care about? Presumably three things:
a) efficiency of memory use (larger pages == more chance of low fill)
b) max size of a buffer (larger buffer = fewer iovecs to pass around)
c) alignment
I don't think we can make these map 1:1 to any of the knobs we discussed
at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
SW GRO can glue back together).
We could simply let the user control #1 - basically user control
overrides the places where driver would previously use PAGE_SIZE.
I think this is what Stan suggested long ago as well.
But I wonder if user still needs to know #2 (rx-buf-len) because
practically speaking, setting page size >4x the size of rx-buf-len
is likely a lot more fragmentation for little extra aggregation.. ?
Tho, admittedly I think user only needs to know max-rx-buf-len
not necessarily set it.
The last knob is alignment / reuse. For allowing multiple packets in
one buffer we probably need to distinguish these cases to cater to
sufficiently clever adapters:
- previous and next packets are from the same flow and
- within one GRO session
- previous had PSH set (or closed the GRO for another reason,
this is to allow realigning the buffer on GRO session close)
or
- the device doesn't know further distinctions / HW-GRO
- previous and next are from different flows
And the actions (for each case separately) are one of:
- no reuse allowed (release buffer = -1?)
- reuse but must align (align to = N)
- reuse don't align (pack = 0)
So to restate do we need:
- "page order" control
- max-rx-buf-len
- 4 alignment knobs?
Corner cases
============
I. Non-power of 2 buffer sizes
------------------------------
Looks like multiple devices are limited by width of length fields,
making max buffer size something like 32kB - 1 or 64kB - 1.
Should we allow applications to configure the buffer to
power of 2 - alignment
? It will probably annoy the page pool code a bit. I guess for now
we should just make sure that uAPI doesn't bake in the idea that
buffers are always power of 2.
II. Fractional page sizes
-------------------------
If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
should we support hunking devmem/iouring into less than a PAGE_SIZE?
Powered by blists - more mailing lists