netdev - Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251022170909.70f1d1e7@kernel.org>
Date: Wed, 22 Oct 2025 17:09:09 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Dragos Tatulea <dtatulea@...dia.com>
Cc: Mina Almasry <almasrymina@...gle.com>, Pavel Begunkov
 <asml.silence@...il.com>, netdev@...r.kernel.org, Andrew Lunn
 <andrew@...n.ch>, davem@...emloft.net, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Donald
 Hunter <donald.hunter@...il.com>, Michael Chan <michael.chan@...adcom.com>,
 Pavan Chebbi <pavan.chebbi@...adcom.com>, Jesper Dangaard Brouer
 <hawk@...nel.org>, John Fastabend <john.fastabend@...il.com>, Stanislav
 Fomichev <sdf@...ichev.me>, Joshua Washington <joshwash@...gle.com>,
 Harshitha Ramamurthy <hramamurthy@...gle.com>, Jian Shen
 <shenjian15@...wei.com>, Salil Mehta <salil.mehta@...wei.com>, Jijie Shao
 <shaojijie@...wei.com>, Sunil Goutham <sgoutham@...vell.com>, Geetha
 sowjanya <gakula@...vell.com>, Subbaraya Sundeep <sbhatta@...vell.com>,
 hariprasad <hkelam@...vell.com>, Bharat Bhushan <bbhushan2@...vell.com>,
 Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, Mark
 Bloch <mbloch@...dia.com>, Alexander Duyck <alexanderduyck@...com>,
 kernel-team@...a.com, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Joe
 Damato <joe@...a.to>, David Wei <dw@...idwei.uk>, Willem de Bruijn
 <willemb@...gle.com>, Breno Leitao <leitao@...ian.org>,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, Jonathan Corbet
 <corbet@....net>
Subject: Re: [PATCH net-next v4 00/24][pull request] Queue configs and large
 buffer providers

On Wed, 22 Oct 2025 13:17:43 +0000 Dragos Tatulea wrote:
> On Thu, Oct 16, 2025 at 06:40:31PM -0700, Jakub Kicinski wrote:
> > On Wed, 15 Oct 2025 10:44:19 -0700 Mina Almasry wrote:  
> > > I think what you're saying is what I was trying to say, but you said
> > > it more eloquently and genetically correct. I'm not familiar with the
> > > GRO packing you're referring to so. I just assumed the 'buffer sizes
> > > actually posted to the NIC' are the 'buffer sizes we end up seeing in
> > > the skb frags'.  
> > 
> > I don't think that code path exists today, buffers posted are frags
> > in the skb. But that's easily fixable.
> >   
> > > I guess what I'm trying to say in a different way, is: there are lots
> > > of buffer sizes in the rx path, AFAICT, at least:
> > > 
> > > 1. The size of the allocated netmems from the pp.
> > > 2. The size of the buffers posted to the NIC (which will be different
> > > from #1 if the page_pool_fragment_netmem or some other trick like
> > > hns3).
> > > 3. The size of the frags that end up in the skb (which will be
> > > different from #2 for GRO/other things I don't fully understand).
> > > 
> > > ...and I'm not sure what rx-buf-len should actually configure. My
> > > thinking is that it probably should configure #3, since that is what
> > > the user cares about, I agree with that.
> > > 
> > > IIRC when I last looked at this a few weeks ago, I think as written
> > > this patch series makes rx-buf-len actually configure #1.  
> > 
> > #1 or #2. #1 for otx2. For the RFC bnxt implementation they were
> > equivalent. But hns3's reading would be that it's #2.
> > 
> > From user PoV neither #1 nor #2 is particularly meaningful.
> > Assuming driver can fragment - #1 only configures memory accounting
> > blocks. #2 configures buffers passed to the HW, but some HW can pack
> > payloads into a single buf to save memory. Which means that if previous
> > frame was small and ate some of a page, subsequent large frame of
> > size M may not fit into a single buf of size X, even if M < X.
> > 
> > So I think the full set of parameters we should define would be
> > what you defined as #1 and #2. And on top of that we need some kind of
> > min alignment enforcement. David Wei mentioned that one of his main use
> > cases is ZC of a buffer which is then sent to storage, which has strict
> > alignment requirements. And some NICs will internally fragment the
> > page.. Maybe let's define the expected device behavior..
> > 
> > Device models
> > =============
> > Assume we receive 2 5kB packets, "x" means bytes from first packet,
> > "y" means bytes from the second packet.
> > 
> > A. Basic-scatter
> > ----------------
> > Packet uses one or more buffers, so 1:n mapping between packets and
> > buffers.
> >                        unused space
> >                       v
> >  1kB      [xx] [xx] [x ] [yy] [yy] [y ]
> > 16kB      [xxxxx            ] [yyyyy             ]
> > 
> > B. Multi-packet
> > ---------------
> > The configurations above are still possible, but we can configure 
> > the device to place multiple packets in a large page:
> >  
> >                  unused space
> >                 v
> > 16kB, 2kB [xxxxx |yyyyy |...] [..................]
> >       ^
> >       alignment / stride
> > 
> > We can probably assume that this model always comes with alignment
> > cause DMA'ing frames at odd offsets is a bad idea. And also note
> > that packets smaller that alignment can get scattered to multiple
> > bufs.
> > 
> > C. Multi-packet HW-GRO
> > ----------------------
> > For completeness, I guess. We need a third packet here. Assume x-packet
> > and z-packet are from the same flow and GRO session, y-packet is not.
> > (Good?) HW-GRO gives us out of order placement and hopefully in this
> > case we do want to pack:
> > 
> > 16kB, 2kB [xxxxxzzzzz |.......] [xxxxx.............]
> >                      ^
> >       alignment / stride
> >   
>                                    ^^^^^
>                                    is this y?

Yes, my bad

> Not sure I understand this last representation: if x and z are 5kB
> packets each and the stride size is 2kB, they should occupy 5 strides:
> 
> 16kB, 2kB [xx|xx|xz|zz|zz|.......] [yy|yy|y |............]
> 
> I think I understand the point, just making sure that I got it straight.
> Did I?

Yes, that's right. I was trying to (poorly) express that the alignment
is not:

  16kB, 2kB [xx|xx|x |zz|zz|z |.....] [yy|yy|y |............]

IOW that HW-GRO is expected to pack (at least by default).

> > End of sidebar. I think / hope these are all practical buffer layouts
> > we need to care about.
> > 
> > 
> > What does user care about? Presumably three things:
> >  a) efficiency of memory use (larger pages == more chance of low fill)
> >  b) max size of a buffer (larger buffer = fewer iovecs to pass around)
> >  c) alignment
> > I don't think we can make these map 1:1 to any of the knobs we discussed
> > at the start. (b) is really neither #1 (if driver fragments) nor #2 (if
> > SW GRO can glue back together).
> > 
> > We could simply let the user control #1 - basically user control
> > overrides the places where driver would previously use PAGE_SIZE.
> > I think this is what Stan suggested long ago as well.
> >
> > But I wonder if user still needs to know #2 (rx-buf-len) because
> > practically speaking, setting page size >4x the size of rx-buf-len
> > is likely a lot more fragmentation for little extra aggregation.. ?  
> So how would rx-buf-len be configured then? Who gets to decide if not the
> user: the driver or the kernel?

Driver.

> I don't understand what you mean by "setting page size >4x the size of
> rx-buf-len". I thought it was the other way around: rx-buf-len is an
> order of page size. Or am I stuck in the mindset of the old proposal?

Yes, rx-buf-len is a good match for model A (basic-scatter).
But in other models it becomes a bit difficult to define the exact
semantics. 

> > Tho, admittedly I think user only needs to know max-rx-buf-len
> > not necessarily set it.
> > 
> > The last knob is alignment / reuse. For allowing multiple packets in
> > one buffer we probably need to distinguish these cases to cater to
> > sufficiently clever adapters:
> >  - previous and next packets are from the same flow and
> >    - within one GRO session
> >    - previous had PSH set (or closed the GRO for another reason,
> >      this is to allow realigning the buffer on GRO session close)
> >   or
> >    - the device doesn't know further distinctions / HW-GRO
> >  - previous and next are from different flows
> > And the actions (for each case separately) are one of:
> >  - no reuse allowed (release buffer = -1?)
> >  - reuse but must align (align to = N)
> >  - reuse don't align (pack = 0)
> >   
> I am assuming that different HW will support a subset of these
> actions and/or they will apply differently in each case (hence the 4
> knobs?).

Yup! All the knobs are optional, we can also define extra ones if use
cases come up and HW can support it. Hopefully we can get selftests 
to validate the devices behave as configured.

> For example, in mlx5 the actions would work only for the second case
> (at the end of a GRO session).

Nice, I think that's the most useful one.

Not sure whether we should define all 4 from the start, or just
document them as a "plan", what the implicit default is expected 
to be (e.g. HW-GRO packs within a session) and then add as HW/users 
come around.

> > So to restate do we need:
> >  - "page order" control
> >  - max-rx-buf-len
> >  - 4 alignment knobs?
> >  
> We do need at least 1 alignment knob.
> 
> > Corner cases
> > ============
> > I. Non-power of 2 buffer sizes
> > ------------------------------
> > Looks like multiple devices are limited by width of length fields,
> > making max buffer size something like 32kB - 1 or 64kB - 1.
> > Should we allow applications to configure the buffer to 
> > 
> >     power of 2 - alignment 
> > 
> > ? It will probably annoy the page pool code a bit. I guess for now
> > we should just make sure that uAPI doesn't bake in the idea that
> > buffers are always power of 2.  
> What if the hardware uses a log scheme to represent the buffer
> length? Then it would still need to align down to the next power of 2?

Yes, that's fine, pow-of-2 should obviously work. I was trying to say
that we shouldn't _require_ power-of-2 in uAPI, because devices with
are limited to (power-of-2 - 1) would strand ~half of the max length.

> > II. Fractional page sizes
> > -------------------------
> > If the HW has max-rx-buf-len of 16k or 32k, and PAGE_SIZE is 64k
> > should we support hunking devmem/iouring into less than a PAGE_SIZE?  

Thanks for the comments!