[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <avwfxfpogp7u7ef5wqrfkqsgvzmnytxblwul7e53eaje3zyqyc@7wvlrocyre6j>
Date: Wed, 22 Oct 2025 13:17:43 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>,
Mina Almasry <almasrymina@...gle.com>
Cc: Pavel Begunkov <asml.silence@...il.com>, netdev@...r.kernel.org,
Andrew Lunn <andrew@...n.ch>, davem@...emloft.net, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Donald Hunter <donald.hunter@...il.com>, Michael Chan <michael.chan@...adcom.com>,
Pavan Chebbi <pavan.chebbi@...adcom.com>, Jesper Dangaard Brouer <hawk@...nel.org>,
John Fastabend <john.fastabend@...il.com>, Stanislav Fomichev <sdf@...ichev.me>,
Joshua Washington <joshwash@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>,
Jian Shen <shenjian15@...wei.com>, Salil Mehta <salil.mehta@...wei.com>,
Jijie Shao <shaojijie@...wei.com>, Sunil Goutham <sgoutham@...vell.com>,
Geetha sowjanya <gakula@...vell.com>, Subbaraya Sundeep <sbhatta@...vell.com>,
hariprasad <hkelam@...vell.com>, Bharat Bhushan <bbhushan2@...vell.com>,
Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>,
Alexander Duyck <alexanderduyck@...com>, kernel-team@...a.com,
Ilias Apalodimas <ilias.apalodimas@...aro.org>, Joe Damato <joe@...a.to>, David Wei <dw@...idwei.uk>,
Willem de Bruijn <willemb@...gle.com>, Breno Leitao <leitao@...ian.org>, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, Jonathan Corbet <corbet@....net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large
buffer providers
Sorry for the late reply, I didn't see the discussion here.
On Thu, Oct 16, 2025 at 06:40:31PM -0700, Jakub Kicinski wrote:
> On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> > I think what you're saying is what I was trying to say, but you said
> > it more eloquently and genetically correct. I'm not familiar with the
> > GRO packing you're referring to so. I just assumed the 'buffer sizes
> > actually posted to the NIC' are the 'buffer sizes we end up seeing in
> > the skb frags'.
>
> I don't think that code path exists today, buffers posted are frags
> in the skb. But that's easily fixable.
>
> > I guess what I'm trying to say in a different way, is: there are lots
> > of buffer sizes in the rx path, AFAICT, at least:
> >
> > 1. The size of the allocated netmems from the pp.
> > 2. The size of the buffers posted to the NIC (which will be different
> > from #1 if the page_pool_fragment_netmem or some other trick like
> > hns3).
> > 3. The size of the frags that end up in the skb (which will be
> > different from #2 for GRO/other things I don't fully understand).
> >
> > ...and I'm not sure what rx-buf-len should actually configure. My
> > thinking is that it probably should configure #3, since that is what
> > the user cares about, I agree with that.
> >
> > IIRC when I last looked at this a few weeks ago, I think as written
> > this patch series makes rx-buf-len actually configure #1.
>
> #1 or #2. #1 for otx2. For the RFC bnxt implementation they were
> equivalent. But hns3's reading would be that it's #2.
>
> From user PoV neither #1 nor #2 is particularly meaningful.
> Assuming driver can fragment - #1 only configures memory accounting
> blocks. #2 configures buffers passed to the HW, but some HW can pack
> payloads into a single buf to save memory. Which means that if previous
> frame was small and ate some of a page, subsequent large frame of
> size M may not fit into a single buf of size X, even if M < X.
>
> So I think the full set of parameters we should define would be
> what you defined as #1 and #2. And on top of that we need some kind of
> min alignment enforcement. David Wei mentioned that one of his main use
> cases is ZC of a buffer which is then sent to storage, which has strict
> alignment requirements. And some NICs will internally fragment the
> page.. Maybe let's define the expected device behavior..
>
> Device models
> =============
> Assume we receive 2 5kB packets, "x" means bytes from first packet,
> "y" means bytes from the second packet.
>
> A. Basic-scatter
> ----------------
> Packet uses one or more buffers, so 1:n mapping between packets and
> buffers.
> unused space
> v
> 1kB [xx] [xx] [x ] [yy] [yy] [y ]
> 16kB [xxxxx ] [yyyyy ]
>
> B. Multi-packet
> ---------------
> The configurations above are still possible, but we can configure
> the device to place multiple packets in a large page:
>
> unused space
> v
> 16kB, 2kB [xxxxx |yyyyy |...] [..................]
> ^
> alignment / stride
>
> We can probably assume that this model always comes with alignment
> cause DMA'ing frames at odd offsets is a bad idea. And also note
> that packets smaller that alignment can get scattered to multiple
> bufs.
>
> C. Multi-packet HW-GRO
> ----------------------
> For completeness, I guess. We need a third packet here. Assume x-packet
> and z-packet are from the same flow and GRO session, y-packet is not.
> (Good?) HW-GRO gives us out of order placement and hopefully in this
> case we do want to pack:
>
> 16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
> ^
> alignment / stride
>
^^^^^
is this y?
Not sure I understand this last representation: if x and z are 5kB
packets each and the stride size is 2kB, they should occupy 5 strides:
16kB, 2kB [xx|xx|xz|zz|zz|.......] [yy|yy|y |............]
I think I understand the point, just making sure that I got it straight.
Did I?
>
> End of sidebar. I think / hope these are all practical buffer layouts
> we need to care about.
>
>
> What does user care about? Presumably three things:
> a) efficiency of memory use (larger pages == more chance of low fill)
> b) max size of a buffer (larger buffer = fewer iovecs to pass around)
> c) alignment
> I don't think we can make these map 1:1 to any of the knobs we discussed
> at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
> SW GRO can glue back together).
>
> We could simply let the user control #1 - basically user control
> overrides the places where driver would previously use PAGE_SIZE.
> I think this is what Stan suggested long ago as well.
>
> But I wonder if user still needs to know #2 (rx-buf-len) because
> practically speaking, setting page size >4x the size of rx-buf-len
> is likely a lot more fragmentation for little extra aggregation.. ?
So how would rx-buf-len be configured then? Who gets to decide if not the
user: the driver or the kernel?
I don't understand what you mean by "setting page size >4x the size of
rx-buf-len". I thought it was the other way around: rx-buf-len is an
order of page size. Or am I stuck in the mindset of the old proposal?
> Tho, admittedly I think user only needs to know max-rx-buf-len
> not necessarily set it.
>
> The last knob is alignment / reuse. For allowing multiple packets in
> one buffer we probably need to distinguish these cases to cater to
> sufficiently clever adapters:
> - previous and next packets are from the same flow and
> - within one GRO session
> - previous had PSH set (or closed the GRO for another reason,
> this is to allow realigning the buffer on GRO session close)
> or
> - the device doesn't know further distinctions / HW-GRO
> - previous and next are from different flows
> And the actions (for each case separately) are one of:
> - no reuse allowed (release buffer = -1?)
> - reuse but must align (align to = N)
> - reuse don't align (pack = 0)
>
I am assuming that different HW will support a subset of these
actions and/or they will apply differently in each case (hence the 4
knobs?).
For example, in mlx5 the actions would work only for the second case
(at the end of a GRO session).
> So to restate do we need:
> - "page order" control
> - max-rx-buf-len
> - 4 alignment knobs?
>
We do need at least 1 alignment knob.
> Corner cases
> ============
> I. Non-power of 2 buffer sizes
> ------------------------------
> Looks like multiple devices are limited by width of length fields,
> making max buffer size something like 32kB - 1 or 64kB - 1.
> Should we allow applications to configure the buffer to
>
> power of 2 - alignment
>
> ? It will probably annoy the page pool code a bit. I guess for now
> we should just make sure that uAPI doesn't bake in the idea that
> buffers are always power of 2.
What if the hardware uses a log scheme to represent the buffer
length? Then it would still need to align down to the next power of 2?
>
> II. Fractional page sizes
> -------------------------
> If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
> should we support hunking devmem/iouring into less than a PAGE_SIZE?
Thanks,
Dragos
Powered by blists - more mailing lists