linux-kernel - Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251016184031.66c92962@kernel.org>
Date: Thu, 16 Oct 2025 18:40:31 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Pavel Begunkov <asml.silence@...il.com>, netdev@...r.kernel.org, Andrew
 Lunn <andrew@...n.ch>, davem@...emloft.net, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Simon Horman
 <horms@...nel.org>, Donald Hunter <donald.hunter@...il.com>, Michael Chan
 <michael.chan@...adcom.com>, Pavan Chebbi <pavan.chebbi@...adcom.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>, John Fastabend
 <john.fastabend@...il.com>, Stanislav Fomichev <sdf@...ichev.me>, Joshua
 Washington <joshwash@...gle.com>, Harshitha Ramamurthy
 <hramamurthy@...gle.com>, Jian Shen <shenjian15@...wei.com>, Salil Mehta
 <salil.mehta@...wei.com>, Jijie Shao <shaojijie@...wei.com>, Sunil Goutham
 <sgoutham@...vell.com>, Geetha sowjanya <gakula@...vell.com>, Subbaraya
 Sundeep <sbhatta@...vell.com>, hariprasad <hkelam@...vell.com>, Bharat
 Bhushan <bbhushan2@...vell.com>, Saeed Mahameed <saeedm@...dia.com>, Tariq
 Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>, Alexander Duyck
 <alexanderduyck@...com>, kernel-team@...a.com, Ilias Apalodimas
 <ilias.apalodimas@...aro.org>, Joe Damato <joe@...a.to>, David Wei
 <dw@...idwei.uk>, Willem de Bruijn <willemb@...gle.com>, Breno Leitao
 <leitao@...ian.org>, Dragos Tatulea <dtatulea@...dia.com>,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, Jonathan Corbet
 <corbet@....net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large
 buffer providers

On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:
> I think what you're saying is what I was trying to say, but you said
> it more eloquently and genetically correct. I'm not familiar with the
> GRO packing you're referring to so. I just assumed the 'buffer sizes
> actually posted to the NIC' are the 'buffer sizes we end up seeing in
> the skb frags'.

I don't think that code path exists today, buffers posted are frags
in the skb. But that's easily fixable.

> I guess what I'm trying to say in a different way, is: there are lots
> of buffer sizes in the rx path, AFAICT, at least:
> 
> 1. The size of the allocated netmems from the pp.
> 2. The size of the buffers posted to the NIC (which will be different
> from #1 if the page_pool_fragment_netmem or some other trick like
> hns3).
> 3. The size of the frags that end up in the skb (which will be
> different from #2 for GRO/other things I don't fully understand).
> 
> ...and I'm not sure what rx-buf-len should actually configure. My
> thinking is that it probably should configure #3, since that is what
> the user cares about, I agree with that.
> 
> IIRC when I last looked at this a few weeks ago, I think as written
> this patch series makes rx-buf-len actually configure #1.

#1 or #2. #1 for otx2. For the RFC bnxt implementation they were
equivalent. But hns3's reading would be that it's #2.

From user PoV neither #1 nor #2 is particularly meaningful.
Assuming driver can fragment - #1 only configures memory accounting
blocks. #2 configures buffers passed to the HW, but some HW can pack
payloads into a single buf to save memory. Which means that if previous
frame was small and ate some of a page, subsequent large frame of
size M may not fit into a single buf of size X, even if M < X.

So I think the full set of parameters we should define would be
what you defined as #1 and #2. And on top of that we need some kind of
min alignment enforcement. David Wei mentioned that one of his main use
cases is ZC of a buffer which is then sent to storage, which has strict
alignment requirements. And some NICs will internally fragment the
page.. Maybe let's define the expected device behavior..

Device models
=============
Assume we receive 2 5kB packets, "x" means bytes from first packet,
"y" means bytes from the second packet.

A. Basic-scatter
----------------
Packet uses one or more buffers, so 1:n mapping between packets and
buffers.
                       unused space
                      v
 1kB      [xx] [xx] [x ] [yy] [yy] [y ]
16kB      [xxxxx            ] [yyyyy             ]

B. Multi-packet
---------------
The configurations above are still possible, but we can configure 
the device to place multiple packets in a large page:
 
                 unused space
                v
16kB, 2kB [xxxxx |yyyyy |...] [..................]
      ^
      alignment / stride

We can probably assume that this model always comes with alignment
cause DMA'ing frames at odd offsets is a bad idea. And also note
that packets smaller that alignment can get scattered to multiple
bufs.

C. Multi-packet HW-GRO
----------------------
For completeness, I guess. We need a third packet here. Assume x-packet
and z-packet are from the same flow and GRO session, y-packet is not.
(Good?) HW-GRO gives us out of order placement and hopefully in this
case we do want to pack:

16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
                     ^
      alignment / stride


End of sidebar. I think / hope these are all practical buffer layouts
we need to care about.


What does user care about? Presumably three things:
 a) efficiency of memory use (larger pages == more chance of low fill)
 b) max size of a buffer (larger buffer = fewer iovecs to pass around)
 c) alignment
I don't think we can make these map 1:1 to any of the knobs we discussed
at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
SW GRO can glue back together).

We could simply let the user control #1 - basically user control
overrides the places where driver would previously use PAGE_SIZE.
I think this is what Stan suggested long ago as well.

But I wonder if user still needs to know #2 (rx-buf-len) because
practically speaking, setting page size >4x the size of rx-buf-len
is likely a lot more fragmentation for little extra aggregation.. ?
Tho, admittedly I think user only needs to know max-rx-buf-len
not necessarily set it.

The last knob is alignment / reuse. For allowing multiple packets in
one buffer we probably need to distinguish these cases to cater to
sufficiently clever adapters:
 - previous and next packets are from the same flow and
   - within one GRO session
   - previous had PSH set (or closed the GRO for another reason,
     this is to allow realigning the buffer on GRO session close)
  or
   - the device doesn't know further distinctions / HW-GRO
 - previous and next are from different flows
And the actions (for each case separately) are one of:
 - no reuse allowed (release buffer = -1?)
 - reuse but must align (align to = N)
 - reuse don't align (pack = 0)

So to restate do we need:
 - "page order" control
 - max-rx-buf-len
 - 4 alignment knobs?

Corner cases
============
I. Non-power of 2 buffer sizes
------------------------------
Looks like multiple devices are limited by width of length fields,
making max buffer size something like 32kB - 1 or 64kB - 1.
Should we allow applications to configure the buffer to 

    power of 2 - alignment 

? It will probably annoy the page pool code a bit. I guess for now
we should just make sure that uAPI doesn't bake in the idea that
buffers are always power of 2.

II. Fractional page sizes
-------------------------
If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
should we support hunking devmem/iouring into less than a PAGE_SIZE?