lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aL8BSjUbxYvvvZyD@devvm11784.nha0.facebook.com>
Date: Mon, 8 Sep 2025 09:16:10 -0700
From: Bobby Eshleman <bobbyeshleman@...il.com>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: Mina Almasry <almasrymina@...gle.com>,
	Matthew Wilcox <willy@...radead.org>,
	Samiullah Khawaja <skhawaja@...gle.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>,
	Kuniyuki Iwashima <kuniyu@...gle.com>,
	Willem de Bruijn <willemb@...gle.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
	Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next 2/2] net: devmem: use niov array for token
 management

On Fri, Sep 05, 2025 at 01:38:39PM -0700, Stanislav Fomichev wrote:
> On 09/05, Mina Almasry wrote:
> > On Wed, Sep 3, 2025 at 5:15 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
> > >
> > > On Wed, Sep 03, 2025 at 01:20:57PM -0700, Mina Almasry wrote:
> > > > On Tue, Sep 2, 2025 at 2:36 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:

[...]

> > > >
> > > > AFAIU, if you made sk_user_frags an array of (unref, binding) tuples
> > > > instead of just an array of urefs then you can remove the
> > > > single-binding restriction.
> > > >
> > > > Although, I wonder what happens if the socket receives the netmem at
> > > > the same index on 2 different dmabufs. At that point I assume the
> > > > wrong uref gets incremented? :(
> > > >
> > >
> > > Right. We need some bits to differentiate bindings. Here are some ideas
> > > I've had about this, I wonder what your thoughts are on them:
> > >
> > > 1) Encode a subset of bindings and wait for availability if the encoding
> > > space becomes exhausted. For example, we could encode the binding in 5
> > > bits for outstanding references across 32 bindings and 27 bits (512 GB)
> > > of dmabuf. If recvmsg wants to return a reference to a 33rd binding, it
> > > waits until the user returns enough tokens to release one of the binding
> > > encoding bits (at which point it could be reused for the new reference).
> > >
> > 
> > This, I think, sounds reasonable. supporting up to 2^5 rx dmabuf
> > bindings at once and 2^27 max dmabuf size should be fine for us I
> > think. Although you have to be patient with me, I have to make sure
> > via tests and code inspection that these new limits will be OK. Also
> > please understand the risk that even if the changes don't break us,
> > they may break someone and have to be reverted anyway, although I
> > think the risk is small.
> > 
> > Another suggestion I got from the team is to use a bitmap instead of
> > an array of atomics. I initially thought this could work, but thinking
> > about it more, I think that would not work, no? Because it's not 100%
> > guaranteed that the socket will only get 1 ref on a net_iov. In the
> > case where the driver fragments the net_iov, multiple difference frags
> > could point to the same net_iov which means multiple refs. So it seems
> > we're stuck with an array of atomic_t.
> > 

Yes that is correct, multiple references can occur so the bitmap
will lose all > 1 references.

> > > 2) opt into an extended token (dmabuf_token_v2) via sockopts, and add
> > > the binding ID or other needed information there.
> > >
> > 
> > Eh, I would say this is an overkill? Today the limit of dma-bufs
> > supported is 2^27 and I think the dmabuf size is technically 2^32 or
> > something, but I don't know that we need all this flexibility for
> > devmem tcp. I think adding a breakdown like above may be fine.
> > 

That's my inclination too.

> > > > One way or another the single-binding restriction needs to be removed
> > > > I think. It's regressing a UAPI that currently works.
> > > >
> > 
> > Thinking about this more, if we can't figure out a different way and
> > have to have a strict 1 socket to 1 dma-buf mapping, that may be
> > acceptable...
> > 
> > ...the best way to do it is actually to do this, I think, would be to
> > actually make sure the user can't break the mapping via `ethtool -N`.
> > I.e. when the user tells us to update or delete a flow steering rule
> > that belongs to a devmem socket, reject the request altogether. At
> > that point we could we can be sure that the mapping would not change
> > anyway. Although I don't know how feasible to implement this is.
> > 

I will look at this and see if it is possible, I do think that would be
nicer for the user.

> > AFAICT as well AF_XDP is in a similar boat to devmem in this regard.
> > The AF_XDP docs require flow steering to be configured for the data to
> > be available in the umem (and I assume, if flow steering is
> > reconfigured then the data disappears from the umem?). Stan do you
> > know how this works? If AF_XDP allows the user to break it by
> > reconfiguring flow steering it may also be reasonable to allow the
> > user to break a devmem socket as well (although maybe with
> > clarification in the docs.
> 
> For af_xdp, there is a check in xsk_rcv_check (that runs _after_ bpf
> program that does the redirect exits) and it drops the packet if
> the packet is redirected to the unxpected queue (not the one that
> the socket was bound to). And the only way to observe it after the fact
> is a drop counter on the nic (or, rather, depends on the nic wrt how it
> accounts it). I remember Willem wanted to remove that restriction,
> but looks like he never got to it?
> 
> tl;dr we don't error out explicitly when the user misconfigures the steering
> after the fact (or initially) and drop the packet in the data path.

Sounds like a reasonable default if the explicit ethtool rejection isn't
feasible.

Thanks for the review and comments!

Best,
Bobby

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ