lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mcwr6vpkmbmkrixfnwyxuph6ziy5r2of67vvhgkvpiwxezfdtu@mitfrbd52rty>
Date: Fri, 26 Sep 2025 18:10:23 -0700
From: Jordan Rife <jordan@...fe.io>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, kuba@...nel.org, 
	davem@...emloft.net, razor@...ckwall.org, pabeni@...hat.com, willemb@...gle.com, 
	sdf@...ichev.me, john.fastabend@...il.com, martin.lau@...nel.org, 
	maciej.fijalkowski@...el.com, magnus.karlsson@...el.com, David Wei <dw@...idwei.uk>
Subject: Re: [PATCH net-next 14/20] netkit: Add single device mode for netkit

On Fri, Sep 19, 2025 at 11:31:47PM +0200, Daniel Borkmann wrote:
> Add a single device mode for netkit instead of netkit pairs. The primary
> target for the paired devices is to connect network namespaces, of course,
> and support has been implemented in projects like Cilium [0]. For the rxq
> binding the plan is to support two main scenarios related to single device
> mode:
> 
> * For the use-case of io_uring zero-copy, the control plane can either
>   set up a netkit pair where the peer device can perform rxq binding which
>   is then tied to the lifetime of the peer device, or the control plane
>   can use a regular netkit pair to connect the hostns to a Pod/container
>   and dynamically add/remove rxq bindings through a single device without
>   having to interrupt the device pair. In the case of io_uring, the memory
>   pool is used as skb non-linear pages, and thus the skb will go its way
>   through the regular stack into netkit. Things like the netkit policy when
>   no BPF is attached or skb scrubbing etc apply as-is in case the paired
>   devices are used, or if the backend memory is tied to the single device
>   and traffic goes through a paired device.
> 
> * For the use-case of AF_XDP, the control plane needs to use netkit in the
>   single device mode. The single device mode currently enforces only a
>   pass policy when no BPF is attached, and does not yet support BPF link
>   attachments for AF_XDP. skbs sent to that device get dropped at the
>   moment. Given AF_XDP operates at a lower layer of the stack tying this
>   to the netkit pair did not make sense. In future, the plan is to allow
>   BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
>   application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
>   to push selected egress traffic up to the single netkit device to the
>   local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
>   single netkit into the AF_XDP application (e.g. DHCP replies). Also,
>   the control-plane can dynamically add/remove rxq bindings for the single
>   netkit device without having to interrupt (e.g. down/up cycle) the main
>   netkit pair for the Pod which has traffic going in and out.

This seems very cool. I'm curious, in single device mode, how would
traffic originating in the host ns make its way into a pod hosting a
QEMU VM using an AF_XDP backend? How would redirection work between two
such VMs on the same host?

> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> Co-developed-by: David Wei <dw@...idwei.uk>
> Signed-off-by: David Wei <dw@...idwei.uk>
> Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
> ---
>  drivers/net/netkit.c         | 108 ++++++++++++++++++++++-------------
>  include/uapi/linux/if_link.h |   6 ++
>  2 files changed, 74 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 492be60f2e70..ceb1393ee599 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -25,6 +25,7 @@ struct netkit {
>  
>  	/* Needed in slow-path */
>  	enum netkit_mode mode;
> +	enum netkit_pairing pair;
>  	bool primary;
>  	u32 headroom;
>  };
> @@ -133,6 +134,10 @@ static int netkit_open(struct net_device *dev)
>  	struct netkit *nk = netkit_priv(dev);
>  	struct net_device *peer = rtnl_dereference(nk->peer);
>  
> +	if (nk->pair == NETKIT_DEVICE_SINGLE) {
> +		netif_carrier_on(dev);
> +		return 0;
> +	}
>  	if (!peer)
>  		return -ENOTCONN;
>  	if (peer->flags & IFF_UP) {
> @@ -333,6 +338,7 @@ static int netkit_new_link(struct net_device *dev,
>  	enum netkit_scrub scrub_prim = NETKIT_SCRUB_DEFAULT;
>  	enum netkit_scrub scrub_peer = NETKIT_SCRUB_DEFAULT;
>  	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp, *attr;
> +	enum netkit_pairing pair = NETKIT_DEVICE_PAIR;
>  	enum netkit_action policy_prim = NETKIT_PASS;
>  	enum netkit_action policy_peer = NETKIT_PASS;
>  	struct nlattr **data = params->data;
> @@ -341,7 +347,7 @@ static int netkit_new_link(struct net_device *dev,
>  	struct nlattr **tb = params->tb;
>  	u16 headroom = 0, tailroom = 0;
>  	struct ifinfomsg *ifmp = NULL;
> -	struct net_device *peer;
> +	struct net_device *peer = NULL;
>  	char ifname[IFNAMSIZ];
>  	struct netkit *nk;
>  	int err;
> @@ -378,6 +384,8 @@ static int netkit_new_link(struct net_device *dev,
>  			headroom = nla_get_u16(data[IFLA_NETKIT_HEADROOM]);
>  		if (data[IFLA_NETKIT_TAILROOM])
>  			tailroom = nla_get_u16(data[IFLA_NETKIT_TAILROOM]);
> +		if (data[IFLA_NETKIT_PAIRING])
> +			pair = nla_get_u32(data[IFLA_NETKIT_PAIRING]);
>  	}
>  
>  	if (ifmp && tbp[IFLA_IFNAME]) {
> @@ -390,45 +398,49 @@ static int netkit_new_link(struct net_device *dev,
>  	if (mode != NETKIT_L2 &&
>  	    (tb[IFLA_ADDRESS] || tbp[IFLA_ADDRESS]))
>  		return -EOPNOTSUPP;
> +	if (pair != NETKIT_DEVICE_PAIR &&

nit: IMO this would be a little clearer without the inverted logic:

if (pair == NETKIT_DEVICE_SINGLE &&

> +	    (tb != tbp ||
> +	     tb[IFLA_NETKIT_PEER_POLICY] ||
> +	     tb[IFLA_NETKIT_PEER_SCRUB] ||
> +	     policy_prim != NETKIT_PASS))
> +		return -EOPNOTSUPP;
>  
> -	peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
> -				&netkit_link_ops, tbp, extack);
> -	if (IS_ERR(peer))
> -		return PTR_ERR(peer);
> -
> -	netif_inherit_tso_max(peer, dev);
> -	if (headroom) {
> -		peer->needed_headroom = headroom;
> -		dev->needed_headroom = headroom;
> -	}
> -	if (tailroom) {
> -		peer->needed_tailroom = tailroom;
> -		dev->needed_tailroom = tailroom;
> -	}
> -
> -	if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
> -		eth_hw_addr_random(peer);
> -	if (ifmp && dev->ifindex)
> -		peer->ifindex = ifmp->ifi_index;
> -
> -	nk = netkit_priv(peer);
> -	nk->primary = false;
> -	nk->policy = policy_peer;
> -	nk->scrub = scrub_peer;
> -	nk->mode = mode;
> -	nk->headroom = headroom;
> -	bpf_mprog_bundle_init(&nk->bundle);
> +	if (pair == NETKIT_DEVICE_PAIR) {
> +		peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
> +					&netkit_link_ops, tbp, extack);
> +		if (IS_ERR(peer))
> +			return PTR_ERR(peer);
> +
> +		netif_inherit_tso_max(peer, dev);
> +		if (headroom)
> +			peer->needed_headroom = headroom;
> +		if (tailroom)
> +			peer->needed_tailroom = tailroom;
> +		if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
> +			eth_hw_addr_random(peer);
> +		if (ifmp && dev->ifindex)
> +			peer->ifindex = ifmp->ifi_index;
>  
> -	err = register_netdevice(peer);
> -	if (err < 0)
> -		goto err_register_peer;
> -	netif_carrier_off(peer);
> -	if (mode == NETKIT_L2)
> -		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
> +		nk = netkit_priv(peer);
> +		nk->primary = false;
> +		nk->policy = policy_peer;
> +		nk->scrub = scrub_peer;
> +		nk->mode = mode;
> +		nk->pair = pair;
> +		nk->headroom = headroom;
> +		bpf_mprog_bundle_init(&nk->bundle);
> +
> +		err = register_netdevice(peer);
> +		if (err < 0)
> +			goto err_register_peer;
> +		netif_carrier_off(peer);
> +		if (mode == NETKIT_L2)
> +			dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>  
> -	err = rtnl_configure_link(peer, NULL, 0, NULL);
> -	if (err < 0)
> -		goto err_configure_peer;
> +		err = rtnl_configure_link(peer, NULL, 0, NULL);
> +		if (err < 0)
> +			goto err_configure_peer;
> +	}
>  
>  	if (mode == NETKIT_L2 && !tb[IFLA_ADDRESS])
>  		eth_hw_addr_random(dev);
> @@ -436,12 +448,17 @@ static int netkit_new_link(struct net_device *dev,
>  		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>  	else
>  		strscpy(dev->name, "nk%d", IFNAMSIZ);
> +	if (headroom)
> +		dev->needed_headroom = headroom;
> +	if (tailroom)
> +		dev->needed_tailroom = tailroom;
>  
>  	nk = netkit_priv(dev);
>  	nk->primary = true;
>  	nk->policy = policy_prim;
>  	nk->scrub = scrub_prim;
>  	nk->mode = mode;
> +	nk->pair = pair;
>  	nk->headroom = headroom;
>  	bpf_mprog_bundle_init(&nk->bundle);
>  
> @@ -453,10 +470,12 @@ static int netkit_new_link(struct net_device *dev,
>  		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>  
>  	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
> -	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
> +	if (peer)
> +		rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>  	return 0;
>  err_configure_peer:
> -	unregister_netdevice(peer);
> +	if (peer)
> +		unregister_netdevice(peer);
>  	return err;
>  err_register_peer:
>  	free_netdev(peer);
> @@ -516,6 +535,8 @@ static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 whi
>  	nk = netkit_priv(dev);
>  	if (!nk->primary)
>  		return ERR_PTR(-EACCES);
> +	if (nk->pair == NETKIT_DEVICE_SINGLE)
> +		return ERR_PTR(-EOPNOTSUPP);
>  	if (which == BPF_NETKIT_PEER) {
>  		dev = rcu_dereference_rtnl(nk->peer);
>  		if (!dev)
> @@ -877,6 +898,7 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
>  		{ IFLA_NETKIT_PEER_INFO,  "peer info" },
>  		{ IFLA_NETKIT_HEADROOM,   "headroom" },
>  		{ IFLA_NETKIT_TAILROOM,   "tailroom" },
> +		{ IFLA_NETKIT_PAIRING,    "pairing" },
>  	};
>  
>  	if (!nk->primary) {
> @@ -896,9 +918,11 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
>  	}
>  
>  	if (data[IFLA_NETKIT_POLICY]) {
> +		err = -EOPNOTSUPP;
>  		attr = data[IFLA_NETKIT_POLICY];
>  		policy = nla_get_u32(attr);
> -		err = netkit_check_policy(policy, attr, extack);
> +		if (nk->pair == NETKIT_DEVICE_PAIR)
> +			err = netkit_check_policy(policy, attr, extack);
>  		if (err)
>  			return err;
>  		WRITE_ONCE(nk->policy, policy);
> @@ -929,6 +953,7 @@ static size_t netkit_get_size(const struct net_device *dev)
>  	       nla_total_size(sizeof(u8))  + /* IFLA_NETKIT_PRIMARY */
>  	       nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_HEADROOM */
>  	       nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_TAILROOM */
> +	       nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PAIRING */
>  	       0;
>  }
>  
> @@ -949,6 +974,8 @@ static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
>  		return -EMSGSIZE;
>  	if (nla_put_u16(skb, IFLA_NETKIT_TAILROOM, dev->needed_tailroom))
>  		return -EMSGSIZE;
> +	if (nla_put_u32(skb, IFLA_NETKIT_PAIRING, nk->pair))
> +		return -EMSGSIZE;
>  
>  	if (peer) {
>  		nk = netkit_priv(peer);
> @@ -970,6 +997,7 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
>  	[IFLA_NETKIT_TAILROOM]		= { .type = NLA_U16 },
>  	[IFLA_NETKIT_SCRUB]		= NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
>  	[IFLA_NETKIT_PEER_SCRUB]	= NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
> +	[IFLA_NETKIT_PAIRING]		= NLA_POLICY_MAX(NLA_U32, NETKIT_DEVICE_SINGLE),
>  	[IFLA_NETKIT_PRIMARY]		= { .type = NLA_REJECT,
>  					    .reject_message = "Primary attribute is read-only" },
>  };
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 45f56c9f95d9..4a2f781f3cca 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -1294,6 +1294,11 @@ enum netkit_mode {
>  	NETKIT_L3,
>  };
>  
> +enum netkit_pairing {
> +	NETKIT_DEVICE_PAIR,
> +	NETKIT_DEVICE_SINGLE,
> +};
> +
>  /* NETKIT_SCRUB_NONE leaves clearing skb->{mark,priority} up to
>   * the BPF program if attached. This also means the latter can
>   * consume the two fields if they were populated earlier.
> @@ -1318,6 +1323,7 @@ enum {
>  	IFLA_NETKIT_PEER_SCRUB,
>  	IFLA_NETKIT_HEADROOM,
>  	IFLA_NETKIT_TAILROOM,
> +	IFLA_NETKIT_PAIRING,
>  	__IFLA_NETKIT_MAX,
>  };
>  #define IFLA_NETKIT_MAX	(__IFLA_NETKIT_MAX - 1)
> -- 
> 2.43.0
>

Reviewed-by: Jordan Rife <jordan@...fe.io>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ