linux-kernel - Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <avwfxfpogp7u7ef5wqrfkqsgvzmnytxblwul7e53eaje3zyqyc@7wvlrocyre6j>
Date: Wed, 22 Oct 2025 13:17:43 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>, 
	Mina Almasry <almasrymina@...gle.com>
Cc: Pavel Begunkov <asml.silence@...il.com>, netdev@...r.kernel.org, 
	Andrew Lunn <andrew@...n.ch>, davem@...emloft.net, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Donald Hunter <donald.hunter@...il.com>, Michael Chan <michael.chan@...adcom.com>, 
	Pavan Chebbi <pavan.chebbi@...adcom.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	John Fastabend <john.fastabend@...il.com>, Stanislav Fomichev <sdf@...ichev.me>, 
	Joshua Washington <joshwash@...gle.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>, 
	Jian Shen <shenjian15@...wei.com>, Salil Mehta <salil.mehta@...wei.com>, 
	Jijie Shao <shaojijie@...wei.com>, Sunil Goutham <sgoutham@...vell.com>, 
	Geetha sowjanya <gakula@...vell.com>, Subbaraya Sundeep <sbhatta@...vell.com>, 
	hariprasad <hkelam@...vell.com>, Bharat Bhushan <bbhushan2@...vell.com>, 
	Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>, 
	Alexander Duyck <alexanderduyck@...com>, kernel-team@...a.com, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Joe Damato <joe@...a.to>, David Wei <dw@...idwei.uk>, 
	Willem de Bruijn <willemb@...gle.com>, Breno Leitao <leitao@...ian.org>, linux-kernel@...r.kernel.org, 
	linux-doc@...r.kernel.org, Jonathan Corbet <corbet@....net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large
 buffer providers

Sorry for the late reply, I didn't see the discussion here.

On Thu, Oct 16, 2025 at 06:40:31PM -0700, Jakub Kicinski wrote:
> On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> > I think what you're saying is what I was trying to say, but you said
> > it more eloquently and genetically correct. I'm not familiar with the
> > GRO packing you're referring to so. I just assumed the 'buffer sizes
> > actually posted to the NIC' are the 'buffer sizes we end up seeing in
> > the skb frags'.
> 
> I don't think that code path exists today, buffers posted are frags
> in the skb. But that's easily fixable.
> 
> > I guess what I'm trying to say in a different way, is: there are lots
> > of buffer sizes in the rx path, AFAICT, at least:
> > 
> > 1. The size of the allocated netmems from the pp.
> > 2. The size of the buffers posted to the NIC (which will be different
> > from #1 if the page_pool_fragment_netmem or some other trick like
> > hns3).
> > 3. The size of the frags that end up in the skb (which will be
> > different from #2 for GRO/other things I don't fully understand).
> > 
> > ...and I'm not sure what rx-buf-len should actually configure. My
> > thinking is that it probably should configure #3, since that is what
> > the user cares about, I agree with that.
> > 
> > IIRC when I last looked at this a few weeks ago, I think as written
> > this patch series makes rx-buf-len actually configure #1.
> 
> #1 or #2. #1 for otx2. For the RFC bnxt implementation they were
> equivalent. But hns3's reading would be that it's #2.
> 
> From user PoV neither #1 nor #2 is particularly meaningful.
> Assuming driver can fragment - #1 only configures memory accounting
> blocks. #2 configures buffers passed to the HW, but some HW can pack
> payloads into a single buf to save memory. Which means that if previous
> frame was small and ate some of a page, subsequent large frame of
> size M may not fit into a single buf of size X, even if M < X.
> 
> So I think the full set of parameters we should define would be
> what you defined as #1 and #2. And on top of that we need some kind of
> min alignment enforcement. David Wei mentioned that one of his main use
> cases is ZC of a buffer which is then sent to storage, which has strict
> alignment requirements. And some NICs will internally fragment the
> page.. Maybe let's define the expected device behavior..
> 
> Device models
> =============
> Assume we receive 2 5kB packets, "x" means bytes from first packet,
> "y" means bytes from the second packet.
> 
> A. Basic-scatter
> ----------------
> Packet uses one or more buffers, so 1:n mapping between packets and
> buffers.
>                        unused space
>                       v
>  1kB      [xx] [xx] [x ] [yy] [yy] [y ]
> 16kB      [xxxxx            ] [yyyyy             ]
> 
> B. Multi-packet
> ---------------
> The configurations above are still possible, but we can configure 
> the device to place multiple packets in a large page:
>  
>                  unused space
>                 v
> 16kB, 2kB [xxxxx |yyyyy |...] [..................]
>       ^
>       alignment / stride
> 
> We can probably assume that this model always comes with alignment
> cause DMA'ing frames at odd offsets is a bad idea. And also note
> that packets smaller that alignment can get scattered to multiple
> bufs.
> 
> C. Multi-packet HW-GRO
> ----------------------
> For completeness, I guess. We need a third packet here. Assume x-packet
> and z-packet are from the same flow and GRO session, y-packet is not.
> (Good?) HW-GRO gives us out of order placement and hopefully in this
> case we do want to pack:
> 
> 16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
>                      ^
>       alignment / stride
> 
                                   ^^^^^
                                   is this y?

Not sure I understand this last representation: if x and z are 5kB
packets each and the stride size is 2kB, they should occupy 5 strides:

16kB, 2kB [xx|xx|xz|zz|zz|.......] [yy|yy|y |............]

I think I understand the point, just making sure that I got it straight.
Did I?

> 
> End of sidebar. I think / hope these are all practical buffer layouts
> we need to care about.
> 
> 
> What does user care about? Presumably three things:
>  a) efficiency of memory use (larger pages == more chance of low fill)
>  b) max size of a buffer (larger buffer = fewer iovecs to pass around)
>  c) alignment
> I don't think we can make these map 1:1 to any of the knobs we discussed
> at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
> SW GRO can glue back together).
> 
> We could simply let the user control #1 - basically user control
> overrides the places where driver would previously use PAGE_SIZE.
> I think this is what Stan suggested long ago as well.
>
> But I wonder if user still needs to know #2 (rx-buf-len) because
> practically speaking, setting page size >4x the size of rx-buf-len
> is likely a lot more fragmentation for little extra aggregation.. ?
So how would rx-buf-len be configured then? Who gets to decide if not the
user: the driver or the kernel?

I don't understand what you mean by "setting page size >4x the size of
rx-buf-len". I thought it was the other way around: rx-buf-len is an
order of page size. Or am I stuck in the mindset of the old proposal?

> Tho, admittedly I think user only needs to know max-rx-buf-len
> not necessarily set it.
> 
> The last knob is alignment / reuse. For allowing multiple packets in
> one buffer we probably need to distinguish these cases to cater to
> sufficiently clever adapters:
>  - previous and next packets are from the same flow and
>    - within one GRO session
>    - previous had PSH set (or closed the GRO for another reason,
>      this is to allow realigning the buffer on GRO session close)
>   or
>    - the device doesn't know further distinctions / HW-GRO
>  - previous and next are from different flows
> And the actions (for each case separately) are one of:
>  - no reuse allowed (release buffer = -1?)
>  - reuse but must align (align to = N)
>  - reuse don't align (pack = 0)
> 
I am assuming that different HW will support a subset of these
actions and/or they will apply differently in each case (hence the 4
knobs?).

For example, in mlx5 the actions would work only for the second case
(at the end of a GRO session).

> So to restate do we need:
>  - "page order" control
>  - max-rx-buf-len
>  - 4 alignment knobs?
>
We do need at least 1 alignment knob.

> Corner cases
> ============
> I. Non-power of 2 buffer sizes
> ------------------------------
> Looks like multiple devices are limited by width of length fields,
> making max buffer size something like 32kB - 1 or 64kB - 1.
> Should we allow applications to configure the buffer to 
> 
>     power of 2 - alignment 
> 
> ? It will probably annoy the page pool code a bit. I guess for now
> we should just make sure that uAPI doesn't bake in the idea that
> buffers are always power of 2.
What if the hardware uses a log scheme to represent the buffer
length? Then it would still need to align down to the next power of 2?

> 
> II. Fractional page sizes
> -------------------------
> If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
> should we support hunking devmem/iouring into less than a PAGE_SIZE?

Thanks,
Dragos