netdev - Re: [RFC] net: openvswitch: Inroduce a light-weight socket map concept.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG=2xmMzJEReLU13zCGR0cvXgkx3NnxuzhXL_s+2iLGA=pxyqQ@mail.gmail.com>
Date: Thu, 10 Jul 2025 04:18:40 -0500
From: Adrián Moreno <amorenoz@...hat.com>
To: Aaron Conole <aconole@...hat.com>
Cc: Cong Wang <xiyou.wangcong@...il.com>, dev@...nvswitch.org, netdev@...r.kernel.org, 
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Eelco Chaudron <echaudro@...hat.com>, Ilya Maximets <i.maximets@....org>, 
	Mike Pattrick <mpattric@...hat.com>, Florian Westphal <fw@...len.de>, 
	John Fastabend <john.fastabend@...il.com>, Jakub Sitnicki <jakub@...udflare.com>, 
	Joe Stringer <joe@....org>
Subject: Re: [RFC] net: openvswitch: Inroduce a light-weight socket map concept.

On Mon, Jun 30, 2025 at 08:34:56AM -0400, Aaron Conole wrote:
> Cong Wang <xiyou.wangcong@...il.com> writes:
>
> > On Fri, Jun 27, 2025 at 05:00:54PM -0400, Aaron Conole wrote:
> >> The Open vSwitch module allows a user to implemnt a flow-based
> >> layer 2 virtual switch.  This is quite useful to model packet
> >> movement analagous to programmable physical layer 2 switches.
> >> But the openvswitch module doesn't always strictly operate at
> >> layer 2, since it implements higher layer concerns, like
> >> fragmentation reassembly, connection tracking, TTL
> >> manipulations, etc.  Rightly so, it isn't *strictly* a layer
> >> 2 virtual forwarding function.
> >>
> >> Other virtual forwarding technologies allow for additional
> >> concepts that 'break' this strict layer separation beyond
> >> what the openvswitch module provides.  The most handy one for
> >> openvswitch to start looking at is the concept of the socket
> >> map, from eBPF.  This is very useful for TCP connections,
> >> since in many cases we will do container<->container
> >> communication (although this can be generalized for the
> >> phy->container case).
> >>
> >> This patch provides two different implementations of actions
> >> that can be used to construct the same kind of socket map
> >> capability within the openvswitch module.  There are additional
> >> ways of supporting this concept that I've discussed offline,
> >> but want to bring it all up for discussion on the mailing list.
> >> This way, "spirited debate" can occur before I spend too much
> >> time implementing specific userspace support for an approach
> >> that may not be acceptable.  I did 'port' these from
> >> implementations that I had done some preliminary testing with
> >> but no guarantees that what is included actually works well.
> >>
> >> For all of these, they are implemented using raw access to
> >> the tcp socket.  This isn't ideal, and a proper
> >> implementation would reuse the psock infrastructure - but
> >> I wanted to get something that we can all at least poke (fun)
> >> at rather than just being purely theoretical.  Some of the
> >> validation that we may need (for example re-writing the
> >> packet's headers) have been omitted to hopefully make the
> >> implementations a bit easier to parse.  The idea would be
> >> to validate these in the validate_and_copy routines.
> >
> > Maybe it is time to introduce eBPF to openvswitch so that they can
> > share, for example, socket maps, from other layers?
>
> Hi Cong,
>
> Yes, it's a good thought.  I don't really know the best way to introduce
> it.  What I mean is, I'd rather completely switch to using something
> like a set of XDP hooks all the way through, but currently OVS datapath
> is still too complex to express in eBPF, from my understanding.
> Hopefully that can change in the future, and then we can seriously
> consider a larger effort like that.
>
> But let's say we try a partial approach - for example, we can add
> something like an rx side hook at ovs_vport_receive that builds an
> xdp_buff object and passes it along.  That can let us use this one
> specific use case, but it may not give the same kind of "XDP"
> performance boost that people are used to seeing (if that matters) -
> we're long past the time where an skb has been built and may need to
> coalesce or ignore some skb from the driver.  I guess it's still doable,
> and then the userspace can just attach the eBPF programs it wants.
> Probably, we would add the xdp program to the vport object instead of to
> the underlying device?  In that case sketching it in ovs_vport_receive()
> might look like:
>
>   {
>   	struct bpf_prog *xdp_prog;
>   	....
>
>   	xdp_prog = rcu_dereference(vport->xdp_prog);
>   	if (xdp_prog) {
>   		struct xdp_buff xdpb = { ... }; /* Set up buff */
>   		int result;
>
>   		result = bpf_prog_run(xdp_prog, &xdpb);
>   		switch (result) {
>   		... /* break for XDP_PASS, and 'exit' for others*/
>   		}
>   	}
>
>   	/* Extract flow from 'skb' into 'key'. */
> 	error = ovs_flow_key_extract(tun_info, skb, &key);
>
>   }
>
> Then I guess we can just do all the accounting, etc. in userspace.  But
> I'm not sure whether that's better than being able to implement the
> entire datapath in XDP from the start.  Although, it does solve the
> immediate idea - getting access to bpf sockmap.  Not sure what others
> think about that approach.  It's important to note that currently, OVS
> userspace would need quite a bit more work to deal with this since it
> isn't in the 'flow' that OVS is setup to deal with, so things like
> accounting need to be changed on that side anyway.  The kernel hook for
> it could be easier to add overall.

Is there a chance to add it as a part of the netdev-offload API? If so,
we could at least reuse the existing accounting infrastructure.

This new offload provider would keep track of sockets per vport by
reusing sockmap infra.
As a new flows are "put" (i.e: "offloaded"), it would have keep track
of the recirculation chain that leads to an output action on a vport
where a matching socket also exists. This tracking is costy but we need
to do it anyways right? How would userspace be able to produce
"socket(NS,INO,recirc(0x1))" if not?

When one of these socket mappings is detected, the XDP prog would be
instructed to redirect packets directly and stats would be added to all
flows in the chain.

I haven't thought out the details of this idea but wanted to share it
for the sake of discussion.
Also, maybe there are things that can be made easier by having a higher
level offload API (Eelco might have thoughts on this).

Thanks.
Adrián