[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG=2xmOXG_5da9+yX0z8hruTqgQxaHzRLVVHZU9M9cmZ475Qqw@mail.gmail.com>
Date: Thu, 10 Jul 2025 02:35:26 -0700
From: Adrián Moreno <amorenoz@...hat.com>
To: Aaron Conole <aconole@...hat.com>
Cc: dev@...nvswitch.org, netdev@...r.kernel.org,
Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Eelco Chaudron <echaudro@...hat.com>, Ilya Maximets <i.maximets@....org>,
Mike Pattrick <mpattric@...hat.com>, Florian Westphal <fw@...len.de>,
John Fastabend <john.fastabend@...il.com>, Jakub Sitnicki <jakub@...udflare.com>,
Joe Stringer <joe@....org>
Subject: Re: [RFC] net: openvswitch: Inroduce a light-weight socket map concept.
On Fri, Jun 27, 2025 at 05:00:54PM -0400, Aaron Conole wrote:
> The Open vSwitch module allows a user to implemnt a flow-based
> layer 2 virtual switch. This is quite useful to model packet
> movement analagous to programmable physical layer 2 switches.
> But the openvswitch module doesn't always strictly operate at
> layer 2, since it implements higher layer concerns, like
> fragmentation reassembly, connection tracking, TTL
> manipulations, etc. Rightly so, it isn't *strictly* a layer
> 2 virtual forwarding function.
>
> Other virtual forwarding technologies allow for additional
> concepts that 'break' this strict layer separation beyond
> what the openvswitch module provides. The most handy one for
> openvswitch to start looking at is the concept of the socket
> map, from eBPF. This is very useful for TCP connections,
> since in many cases we will do container<->container
> communication (although this can be generalized for the
> phy->container case).
>
> This patch provides two different implementations of actions
> that can be used to construct the same kind of socket map
> capability within the openvswitch module. There are additional
> ways of supporting this concept that I've discussed offline,
> but want to bring it all up for discussion on the mailing list.
> This way, "spirited debate" can occur before I spend too much
> time implementing specific userspace support for an approach
> that may not be acceptable. I did 'port' these from
> implementations that I had done some preliminary testing with
> but no guarantees that what is included actually works well.
>
> For all of these, they are implemented using raw access to
> the tcp socket. This isn't ideal, and a proper
> implementation would reuse the psock infrastructure - but
> I wanted to get something that we can all at least poke (fun)
> at rather than just being purely theoretical. Some of the
> validation that we may need (for example re-writing the
> packet's headers) have been omitted to hopefully make the
> implementations a bit easier to parse. The idea would be
> to validate these in the validate_and_copy routines.
>
> The first option that I'll present is the suite of management
> actions presented as:
> * sock(commit)
> * sock(try)
> * sock(tuple)
>
> These options would take a 5-tuple stamp of the IP packet
> coming in, and preserve it as it traverses any packet
> modifications, recirculations, etc. The idea of how it would
> look in the datapath might be something like:
>
> + recirc_id(0),eth(src=XXXX,dst=YYYY),eth_type(0x800), \
> | ip(src=a.b.c.d,dst=e.f.g.h,proto=6),tcp(sport=AA,dport=BB) \
> | actions:sock(tuple),sock(try,recirc(1))
> \
> + recirc_id(1),{match-action-pairs}
> ...
> + final-action: sock(commit, 2),output(2)
>
> When a packet enters ovs processing, it would have a tuple
> saved, and then forwarded on to another recirc id (or any
> other actions that might be desired). For the first
> packe, the sock(try,..) action will result in the
> alternativet action list path being taken. As the packet
> is 'moving' through the flow table in the kernel, the
> original tuple details for the socket map would not be
> modified even if the flow key is updated and the physical
> packet changed. Finally, the sock(commit,2) will add
> to the internal table that the stamped tuple should be
> forwarded to the particular output port.
>
> The advantage to this suite of primitives is that the
> userspace implementation may be a bit simpler. Since
> the table entries and management is done internally
> by these actions, userspace is free to blanket adjust
> all of the flows that are tcp destined for a specific
> output port by inserting this special tuple/try set
> without much additional logic for tracking
> connections. Another advantage is that the userspace
> doesn't need to peer into a specific netns and pull
> socket details to find matching tuples.
What if there is no match on tcp headers at all? Would userspace also
add these actions?
I guess you could argue that, if OVS is not being asked to do L4, this
feature would not apply, but what if it's just the first flow that
doesn't have an L4 match? How would userspace now how to add the action?
>
> However, it means we need to keep a separate mapping
> table in the kernel, and that includes all the
> tradeoffs of managing that table (not added in the
> patch because it got too clunky are the workqueues
> that would check the table to basically stop counters
> based on the tcp state so the flow could get expired).
>
> The userspace work gets simpler, but the work being
> done by the kernel space is much more difficult.
>
> Next is the simpler 'socket(net,ino)' action. The
> flows for this may look something a bit more like:
>
> + recirc_id(0),eth(src=XXXX,dst=YYYY),eth_type(0x800), \
> | ip(src=a.b.c.d,dst=e.f.g.h,proto=6),tcp(sport=AA,dport=BB) \
> | actions:socket(NS,INO,recirc(0x1))
>
> This is much more compact, but that hides what is
> really needed to make this work. Using the single
> primitive would require that during upcall processing
> the userspace is aware of the netns for the particular
> packet. When it is generating the flows, it can add
> a socket call at the earliest flow possible when it
> sees a socket that would be associated with the flow.
>
> The kernel then only needs to validate at the flow
> installation time that the socket is valid. However,
> the userspace must do much more work. It will need
> to go back through the flows it generated and modify
> them (or insert new flows) to take this path. It
> will have to peer into each netns and find the
> corresponding socket inode, and program those. If
> it cannot find those details when the socket is
> first added, it will probably not be able to add
> these later (that's an implementation detail, but
> at least it will lead to a slow ramp up).
>
> From a kernel perspective it is much easier. We
> keep a ref to the socket while the action is in
> place, but that's all we need. The
> infrastructure needed is mostly all there, and
> there aren't many new things we need to change
> to make it work.
>
> So, it's much more work on the userspace side,
> but much less 'intrusive' for the kernel.
I'm quite intrigued by what would userspace have to do in this case.
IIUC, it would have to:
1) Track all sockets created in all affected namespaces
2) Associate a socket with an upcalled packet
3) Carry this information throughout recirculations until an output
action is detected and the destination socket found.
4) Traverse the recirculation chain backwards to find the first flow
(recirc_id(0)) and modify it to add the socket action.
Is that what you had in mind or am I completely off-track?
Thanks.
Adrián
Powered by blists - more mailing lists