[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZTfza8hC_79X10F8@google.com>
Date: Tue, 24 Oct 2023 09:40:11 -0700
From: Stanislav Fomichev <sdf@...gle.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, martin.lau@...ux.dev,
razor@...ckwall.org, ast@...nel.org, andrii@...nel.org,
john.fastabend@...il.com, toke@...nel.org, kuba@...nel.org, andrew@...n.ch
Subject: Re: [PATCH bpf-next v3 1/7] netkit, bpf: Add bpf programmable net device
On 10/23, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "netkit"
> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
> core idea is that BPF programs are executed within the drivers xmit routine
> and therefore e.g. in case of containers/Pods moving BPF processing closer
> to the source.
>
> One of the goals was that in case of Pod egress traffic, this allows to
> move BPF programs from hostns tcx ingress into the device itself, providing
> earlier drop or forward mechanisms, for example, if the BPF program
> determines that the skb must be sent out of the node, then a redirect to
> the physical device can take place directly without going through per-CPU
> backlog queue. This helps to shift processing for such traffic from softirq
> to process context, leading to better scheduling decisions/performance (see
> measurements in the slides).
>
> In this initial version, the netkit device ships as a pair, but we plan to
> extend this further so it can also operate in single device mode. The pair
> comes with a primary and a peer device. Only the primary device, typically
> residing in hostns, can manage BPF programs for itself and its peer. The
> peer device is designated for containers/Pods and cannot attach/detach
> BPF programs. Upon the device creation, the user can set the default policy
> to 'forward' or 'drop' for the case when no BPF program is attached.
>
> Additionally, the device can be operated in L3 (default) or L2 mode. The
> management of BPF programs is done via bpf_mprog, so that multi-attach is
> supported right from the beginning with similar API and dependency controls
> as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
> so that existing programs can be easily migrated.
>
> Going forward, we plan to use netkit devices in Cilium as the main device
> type for connecting Pods. They will be operated in L3 mode in order to
> simplify a Pod's neighbor management and the peer will operate in default
> drop mode, so that no traffic is leaving between the time when a Pod is
> brought up by the CNI plugin and programs attached by the agent.
> Additionally, the programs we attach via tcx on the physical devices are
> using bpf_redirect_peer() for inbound traffic into netkit device, hence the
> latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
> bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
> directly. Also, BIG TCP is supported on netkit device. For the follow-up
> work in single device mode, we plan to convert Cilium's cilium_host/_net
> devices into a single one.
>
> An extensive test suite for checking device operations and the BPF program
> and link management API comes as BPF selftests in this series.
>
> Co-developed-by: Nikolay Aleksandrov <razor@...ckwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@...ckwall.org>
> Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
> Link: https://github.com/borkmann/iproute2/tree/pr/netkit
> Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
> ---
> MAINTAINERS | 9 +
> drivers/net/Kconfig | 9 +
> drivers/net/Makefile | 1 +
> drivers/net/netkit.c | 934 +++++++++++++++++++++++++++++++++
> include/net/netkit.h | 38 ++
> include/uapi/linux/bpf.h | 14 +
> include/uapi/linux/if_link.h | 24 +
> kernel/bpf/syscall.c | 30 +-
> tools/include/uapi/linux/bpf.h | 14 +
> 9 files changed, 1068 insertions(+), 5 deletions(-)
> create mode 100644 drivers/net/netkit.c
> create mode 100644 include/net/netkit.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ed33b9df8b3d..43be6197e655 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3795,6 +3795,15 @@ L: bpf@...r.kernel.org
> S: Odd Fixes
> K: (?:\b|_)bpf(?:\b|_)
>
> +BPF [NETKIT] (BPF-programmable network device)
> +M: Daniel Borkmann <daniel@...earbox.net>
> +M: Nikolay Aleksandrov <razor@...ckwall.org>
> +L: bpf@...r.kernel.org
> +L: netdev@...r.kernel.org
> +S: Supported
> +F: drivers/net/netkit.c
> +F: include/net/netkit.h
> +
> BPF [NETWORKING] (struct_ops, reuseport)
> M: Martin KaFai Lau <martin.lau@...ux.dev>
> L: bpf@...r.kernel.org
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 44eeb5d61ba9..af0da4bb429b 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -448,6 +448,15 @@ config NLMON
> diagnostics, etc. This is mostly intended for developers or support
> to debug netlink issues. If unsure, say N.
>
> +config NETKIT
> + bool "BPF-programmable network device"
> + depends on BPF_SYSCALL
> + help
> + The netkit device is a virtual networking device where BPF programs
> + can be attached to the device(s) transmission routine in order to
> + implement the driver's internal logic. The device can be configured
> + to operate in L3 or L2 mode. If unsure, say N.
> +
> config NET_VRF
> tristate "Virtual Routing and Forwarding (Lite)"
> depends on IP_MULTIPLE_TABLES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 8a83db32509d..7cab36f94782 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
> obj-$(CONFIG_NET) += loopback.o
> obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
> obj-$(CONFIG_NETCONSOLE) += netconsole.o
> +obj-$(CONFIG_NETKIT) += netkit.o
> obj-y += phy/
> obj-y += pse-pd/
> obj-y += mdio/
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> new file mode 100644
> index 000000000000..faf756702aa1
> --- /dev/null
> +++ b/drivers/net/netkit.c
> @@ -0,0 +1,934 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/filter.h>
> +#include <linux/netfilter_netdev.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/netkit.h>
> +#include <net/dst.h>
> +#include <net/tcx.h>
> +
> +#define DRV_NAME "netkit"
> +
> +struct netkit {
> + /* Needed in fast-path */
> + struct net_device __rcu *peer;
> + struct bpf_mprog_entry __rcu *active;
> + enum netkit_action policy;
> + struct bpf_mprog_bundle bundle;
> +
> + /* Needed in slow-path */
> + enum netkit_mode mode;
> + bool primary;
> + u32 headroom;
> +};
> +
> +struct netkit_link {
> + struct bpf_link link;
> + struct net_device *dev;
> + u32 location;
> +};
> +
> +static __always_inline int
> +netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> + enum netkit_action ret)
> +{
> + const struct bpf_mprog_fp *fp;
> + const struct bpf_prog *prog;
> +
> + bpf_mprog_foreach_prog(entry, fp, prog) {
> + bpf_compute_data_pointers(skb);
> + ret = bpf_prog_run(prog, skb);
> + if (ret != NETKIT_NEXT)
> + break;
> + }
> + return ret;
> +}
> +
> +static void netkit_prep_forward(struct sk_buff *skb, bool xnet)
> +{
> + skb_scrub_packet(skb, xnet);
> + skb->priority = 0;
> + nf_skip_egress(skb, true);
> +}
> +
> +static struct netkit *netkit_priv(const struct net_device *dev)
> +{
> + return netdev_priv(dev);
> +}
> +
> +static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + enum netkit_action ret = READ_ONCE(nk->policy);
> + netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
> + const struct bpf_mprog_entry *entry;
> + struct net_device *peer;
> +
> + rcu_read_lock();
> + peer = rcu_dereference(nk->peer);
> + if (unlikely(!peer || !(peer->flags & IFF_UP) ||
> + !pskb_may_pull(skb, ETH_HLEN) ||
> + skb_orphan_frags(skb, GFP_ATOMIC)))
> + goto drop;
> + netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
> + skb->dev = peer;
> + entry = rcu_dereference(nk->active);
> + if (entry)
> + ret = netkit_run(entry, skb, ret);
> + switch (ret) {
> + case NETKIT_NEXT:
> + case NETKIT_PASS:
> + skb->protocol = eth_type_trans(skb, skb->dev);
> + skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> + __netif_rx(skb);
> + break;
> + case NETKIT_REDIRECT:
> + skb_do_redirect(skb);
> + break;
> + case NETKIT_DROP:
> + default:
> +drop:
> + kfree_skb(skb);
> + dev_core_stats_tx_dropped_inc(dev);
> + ret_dev = NET_XMIT_DROP;
> + break;
> + }
> + rcu_read_unlock();
> + return ret_dev;
> +}
> +
> +static int netkit_open(struct net_device *dev)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + struct net_device *peer = rtnl_dereference(nk->peer);
> +
> + if (!peer)
> + return -ENOTCONN;
> + if (peer->flags & IFF_UP) {
> + netif_carrier_on(dev);
> + netif_carrier_on(peer);
> + }
> + return 0;
> +}
> +
> +static int netkit_close(struct net_device *dev)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + struct net_device *peer = rtnl_dereference(nk->peer);
> +
> + netif_carrier_off(dev);
> + if (peer)
> + netif_carrier_off(peer);
> + return 0;
> +}
> +
> +static int netkit_get_iflink(const struct net_device *dev)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + struct net_device *peer;
> + int iflink = 0;
> +
> + rcu_read_lock();
> + peer = rcu_dereference(nk->peer);
> + if (peer)
> + iflink = peer->ifindex;
> + rcu_read_unlock();
> + return iflink;
> +}
> +
> +static void netkit_set_multicast(struct net_device *dev)
> +{
> + /* Nothing to do, we receive whatever gets pushed to us! */
> +}
> +
> +static void netkit_set_headroom(struct net_device *dev, int headroom)
> +{
> + struct netkit *nk = netkit_priv(dev), *nk2;
> + struct net_device *peer;
> +
> + if (headroom < 0)
> + headroom = NET_SKB_PAD;
> +
> + rcu_read_lock();
> + peer = rcu_dereference(nk->peer);
> + if (unlikely(!peer))
> + goto out;
> +
> + nk2 = netkit_priv(peer);
> + nk->headroom = headroom;
> + headroom = max(nk->headroom, nk2->headroom);
> +
> + peer->needed_headroom = headroom;
> + dev->needed_headroom = headroom;
> +out:
> + rcu_read_unlock();
> +}
> +
> +static struct net_device *netkit_peer_dev(struct net_device *dev)
> +{
> + return rcu_dereference(netkit_priv(dev)->peer);
> +}
> +
> +static const struct net_device_ops netkit_netdev_ops = {
> + .ndo_open = netkit_open,
> + .ndo_stop = netkit_close,
> + .ndo_start_xmit = netkit_xmit,
> + .ndo_set_rx_mode = netkit_set_multicast,
> + .ndo_set_rx_headroom = netkit_set_headroom,
> + .ndo_get_iflink = netkit_get_iflink,
> + .ndo_get_peer_dev = netkit_peer_dev,
> + .ndo_features_check = passthru_features_check,
> +};
> +
> +static void netkit_get_drvinfo(struct net_device *dev,
> + struct ethtool_drvinfo *info)
> +{
> + strscpy(info->driver, DRV_NAME, sizeof(info->driver));
> +}
> +
> +static const struct ethtool_ops netkit_ethtool_ops = {
> + .get_drvinfo = netkit_get_drvinfo,
> +};
> +
> +static void netkit_setup(struct net_device *dev)
> +{
> + static const netdev_features_t netkit_features_hw_vlan =
> + NETIF_F_HW_VLAN_CTAG_TX |
> + NETIF_F_HW_VLAN_CTAG_RX |
> + NETIF_F_HW_VLAN_STAG_TX |
> + NETIF_F_HW_VLAN_STAG_RX;
> + static const netdev_features_t netkit_features =
> + netkit_features_hw_vlan |
> + NETIF_F_SG |
> + NETIF_F_FRAGLIST |
> + NETIF_F_HW_CSUM |
> + NETIF_F_RXCSUM |
> + NETIF_F_SCTP_CRC |
> + NETIF_F_HIGHDMA |
> + NETIF_F_GSO_SOFTWARE |
> + NETIF_F_GSO_ENCAP_ALL;
> +
> + ether_setup(dev);
> + dev->max_mtu = ETH_MAX_MTU;
> +
> + dev->flags |= IFF_NOARP;
> + dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> + dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> + dev->priv_flags |= IFF_PHONY_HEADROOM;
> + dev->priv_flags |= IFF_NO_QUEUE;
> +
> + dev->ethtool_ops = &netkit_ethtool_ops;
> + dev->netdev_ops = &netkit_netdev_ops;
> +
> + dev->features |= netkit_features | NETIF_F_LLTX;
> + dev->hw_features = netkit_features;
> + dev->hw_enc_features = netkit_features;
> + dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
> + dev->vlan_features = dev->features & ~netkit_features_hw_vlan;
> +
> + dev->needs_free_netdev = true;
> +
> + netif_set_tso_max_size(dev, GSO_MAX_SIZE);
> +}
> +
> +static struct net *netkit_get_link_net(const struct net_device *dev)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + struct net_device *peer = rtnl_dereference(nk->peer);
> +
> + return peer ? dev_net(peer) : dev_net(dev);
> +}
> +
> +static int netkit_check_policy(int policy, struct nlattr *tb,
> + struct netlink_ext_ack *extack)
> +{
> + switch (policy) {
> + case NETKIT_PASS:
> + case NETKIT_DROP:
> + return 0;
> + default:
> + NL_SET_ERR_MSG_ATTR(extack, tb,
> + "Provided default xmit policy not supported");
> + return -EINVAL;
> + }
> +}
> +
> +static int netkit_check_mode(int mode, struct nlattr *tb,
> + struct netlink_ext_ack *extack)
> +{
> + switch (mode) {
> + case NETKIT_L2:
> + case NETKIT_L3:
> + return 0;
> + default:
> + NL_SET_ERR_MSG_ATTR(extack, tb,
> + "Provided device mode can only be L2 or L3");
> + return -EINVAL;
> + }
> +}
> +
> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
> + struct netlink_ext_ack *extack)
> +{
> + struct nlattr *attr = tb[IFLA_ADDRESS];
> +
> + if (!attr)
> + return 0;
> + NL_SET_ERR_MSG_ATTR(extack, attr,
> + "Setting Ethernet address is not supported");
> + return -EOPNOTSUPP;
> +}
> +
> +static struct rtnl_link_ops netkit_link_ops;
> +
> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
> + struct nlattr *tb[], struct nlattr *data[],
> + struct netlink_ext_ack *extack)
> +{
> + struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
> + enum netkit_action default_prim = NETKIT_PASS;
> + enum netkit_action default_peer = NETKIT_PASS;
> + enum netkit_mode mode = NETKIT_L3;
> + unsigned char ifname_assign_type;
> + struct ifinfomsg *ifmp = NULL;
> + struct net_device *peer;
> + char ifname[IFNAMSIZ];
> + struct netkit *nk;
> + struct net *net;
> + int err;
> +
> + if (data) {
> + if (data[IFLA_NETKIT_MODE]) {
> + attr = data[IFLA_NETKIT_MODE];
> + mode = nla_get_u32(attr);
> + err = netkit_check_mode(mode, attr, extack);
> + if (err < 0)
> + return err;
> + }
> + if (data[IFLA_NETKIT_PEER_INFO]) {
> + attr = data[IFLA_NETKIT_PEER_INFO];
> + ifmp = nla_data(attr);
> + err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
> + if (err < 0)
> + return err;
> + err = netkit_validate(peer_tb, NULL, extack);
> + if (err < 0)
> + return err;
> + tbp = peer_tb;
> + }
> + if (data[IFLA_NETKIT_POLICY]) {
> + attr = data[IFLA_NETKIT_POLICY];
> + default_prim = nla_get_u32(attr);
> + err = netkit_check_policy(default_prim, attr, extack);
> + if (err < 0)
> + return err;
> + }
> + if (data[IFLA_NETKIT_PEER_POLICY]) {
> + attr = data[IFLA_NETKIT_PEER_POLICY];
> + default_peer = nla_get_u32(attr);
> + err = netkit_check_policy(default_peer, attr, extack);
> + if (err < 0)
> + return err;
> + }
> + }
> +
> + if (ifmp && tbp[IFLA_IFNAME]) {
> + nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
> + ifname_assign_type = NET_NAME_USER;
> + } else {
> + strscpy(ifname, "nk%d", IFNAMSIZ);
> + ifname_assign_type = NET_NAME_ENUM;
> + }
> +
> + net = rtnl_link_get_net(src_net, tbp);
> + if (IS_ERR(net))
> + return PTR_ERR(net);
> +
> + peer = rtnl_create_link(net, ifname, ifname_assign_type,
> + &netkit_link_ops, tbp, extack);
> + if (IS_ERR(peer)) {
> + put_net(net);
> + return PTR_ERR(peer);
> + }
> +
> + netif_inherit_tso_max(peer, dev);
> +
> + if (mode == NETKIT_L2)
> + eth_hw_addr_random(peer);
> + if (ifmp && dev->ifindex)
> + peer->ifindex = ifmp->ifi_index;
> +
> + nk = netkit_priv(peer);
> + nk->primary = false;
> + nk->policy = default_peer;
> + nk->mode = mode;
> + bpf_mprog_bundle_init(&nk->bundle);
> + RCU_INIT_POINTER(nk->active, NULL);
> + RCU_INIT_POINTER(nk->peer, NULL);
> +
> + err = register_netdevice(peer);
> + put_net(net);
> + if (err < 0)
> + goto err_register_peer;
> + netif_carrier_off(peer);
> + if (mode == NETKIT_L2)
> + dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
> +
> + err = rtnl_configure_link(peer, NULL, 0, NULL);
> + if (err < 0)
> + goto err_configure_peer;
> +
> + if (mode == NETKIT_L2)
> + eth_hw_addr_random(dev);
> + if (tb[IFLA_IFNAME])
> + nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
> + else
> + strscpy(dev->name, "nk%d", IFNAMSIZ);
> +
> + nk = netkit_priv(dev);
> + nk->primary = true;
> + nk->policy = default_prim;
> + nk->mode = mode;
> + bpf_mprog_bundle_init(&nk->bundle);
> + RCU_INIT_POINTER(nk->active, NULL);
> + RCU_INIT_POINTER(nk->peer, NULL);
> +
> + err = register_netdevice(dev);
> + if (err < 0)
> + goto err_configure_peer;
> + netif_carrier_off(dev);
> + if (mode == NETKIT_L2)
> + dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
> +
> + rcu_assign_pointer(netkit_priv(dev)->peer, peer);
> + rcu_assign_pointer(netkit_priv(peer)->peer, dev);
> + return 0;
> +err_configure_peer:
> + unregister_netdevice(peer);
> + return err;
> +err_register_peer:
> + free_netdev(peer);
> + return err;
> +}
> +
> +static struct bpf_mprog_entry *netkit_entry_fetch(struct net_device *dev,
> + bool bundle_fallback)
> +{
> + struct netkit *nk = netkit_priv(dev);
> + struct bpf_mprog_entry *entry;
> +
> + ASSERT_RTNL();
> + entry = rcu_dereference_rtnl(nk->active);
> + if (entry)
> + return entry;
> + if (bundle_fallback)
> + return &nk->bundle.a;
> + return NULL;
> +}
> +
> +static void netkit_entry_update(struct net_device *dev,
> + struct bpf_mprog_entry *entry)
> +{
> + struct netkit *nk = netkit_priv(dev);
> +
> + ASSERT_RTNL();
> + rcu_assign_pointer(nk->active, entry);
> +}
> +
> +static void netkit_entry_sync(void)
> +{
> + synchronize_rcu();
> +}
> +
> +static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 which)
> +{
> + struct net_device *dev;
> + struct netkit *nk;
> +
> + ASSERT_RTNL();
> +
> + switch (which) {
> + case BPF_NETKIT_PRIMARY:
> + case BPF_NETKIT_PEER:
> + break;
> + default:
> + return ERR_PTR(-EINVAL);
> + }
> +
> + dev = __dev_get_by_index(net, ifindex);
> + if (!dev)
> + return ERR_PTR(-ENODEV);
> + if (dev->netdev_ops != &netkit_netdev_ops)
> + return ERR_PTR(-ENXIO);
> +
> + nk = netkit_priv(dev);
> + if (!nk->primary)
> + return ERR_PTR(-EACCES);
> + if (which == BPF_NETKIT_PEER) {
> + dev = rcu_dereference_rtnl(nk->peer);
> + if (!dev)
> + return ERR_PTR(-ENODEV);
> + }
> + return dev;
> +}
> +
> +int netkit_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> + struct bpf_mprog_entry *entry, *entry_new;
> + struct bpf_prog *replace_prog = NULL;
> + struct net_device *dev;
> + int ret;
> +
> + rtnl_lock();
> + dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> + attr->attach_type);
> + if (IS_ERR(dev)) {
> + ret = PTR_ERR(dev);
> + goto out;
> + }
> + entry = netkit_entry_fetch(dev, true);
> + if (attr->attach_flags & BPF_F_REPLACE) {
> + replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
> + prog->type);
> + if (IS_ERR(replace_prog)) {
> + ret = PTR_ERR(replace_prog);
> + replace_prog = NULL;
> + goto out;
> + }
> + }
> + ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
> + attr->attach_flags, attr->relative_fd,
> + attr->expected_revision);
> + if (!ret) {
> + if (entry != entry_new) {
> + netkit_entry_update(dev, entry_new);
> + netkit_entry_sync();
> + }
> + bpf_mprog_commit(entry);
> + }
> +out:
> + if (replace_prog)
> + bpf_prog_put(replace_prog);
> + rtnl_unlock();
> + return ret;
> +}
> +
> +int netkit_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> + struct bpf_mprog_entry *entry, *entry_new;
> + struct net_device *dev;
> + int ret;
> +
> + rtnl_lock();
> + dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> + attr->attach_type);
> + if (IS_ERR(dev)) {
> + ret = PTR_ERR(dev);
> + goto out;
> + }
> + entry = netkit_entry_fetch(dev, false);
> + if (!entry) {
> + ret = -ENOENT;
> + goto out;
> + }
> + ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
> + attr->relative_fd, attr->expected_revision);
> + if (!ret) {
> + if (!bpf_mprog_total(entry_new))
> + entry_new = NULL;
> + netkit_entry_update(dev, entry_new);
> + netkit_entry_sync();
> + bpf_mprog_commit(entry);
> + }
> +out:
> + rtnl_unlock();
> + return ret;
> +}
> +
> +int netkit_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> + struct net_device *dev;
> + int ret;
> +
> + rtnl_lock();
> + dev = netkit_dev_fetch(current->nsproxy->net_ns,
> + attr->query.target_ifindex,
> + attr->query.attach_type);
> + if (IS_ERR(dev)) {
> + ret = PTR_ERR(dev);
> + goto out;
> + }
> + ret = bpf_mprog_query(attr, uattr, netkit_entry_fetch(dev, false));
> +out:
> + rtnl_unlock();
> + return ret;
> +}
> +
> +static struct netkit_link *netkit_link(const struct bpf_link *link)
> +{
> + return container_of(link, struct netkit_link, link);
> +}
> +
> +static int netkit_link_prog_attach(struct bpf_link *link, u32 flags,
> + u32 id_or_fd, u64 revision)
> +{
> + struct netkit_link *nkl = netkit_link(link);
> + struct bpf_mprog_entry *entry, *entry_new;
> + struct net_device *dev = nkl->dev;
> + int ret;
> +
> + ASSERT_RTNL();
> + entry = netkit_entry_fetch(dev, true);
> + ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
> + id_or_fd, revision);
> + if (!ret) {
> + if (entry != entry_new) {
> + netkit_entry_update(dev, entry_new);
> + netkit_entry_sync();
> + }
> + bpf_mprog_commit(entry);
> + }
> + return ret;
> +}
> +
> +static void netkit_link_release(struct bpf_link *link)
> +{
> + struct netkit_link *nkl = netkit_link(link);
> + struct bpf_mprog_entry *entry, *entry_new;
> + struct net_device *dev;
> + int ret = 0;
> +
> + rtnl_lock();
> + dev = nkl->dev;
> + if (!dev)
> + goto out;
> + entry = netkit_entry_fetch(dev, false);
> + if (!entry) {
> + ret = -ENOENT;
> + goto out;
> + }
> + ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
> + if (!ret) {
> + if (!bpf_mprog_total(entry_new))
> + entry_new = NULL;
> + netkit_entry_update(dev, entry_new);
> + netkit_entry_sync();
> + bpf_mprog_commit(entry);
> + nkl->dev = NULL;
> + }
> +out:
> + WARN_ON_ONCE(ret);
> + rtnl_unlock();
> +}
> +
> +static int netkit_link_update(struct bpf_link *link, struct bpf_prog *nprog,
> + struct bpf_prog *oprog)
> +{
> + struct netkit_link *nkl = netkit_link(link);
> + struct bpf_mprog_entry *entry, *entry_new;
> + struct net_device *dev;
> + int ret = 0;
> +
> + rtnl_lock();
> + dev = nkl->dev;
> + if (!dev) {
> + ret = -ENOLINK;
> + goto out;
> + }
> + if (oprog && link->prog != oprog) {
> + ret = -EPERM;
> + goto out;
> + }
> + oprog = link->prog;
> + if (oprog == nprog) {
> + bpf_prog_put(nprog);
> + goto out;
> + }
> + entry = netkit_entry_fetch(dev, false);
> + if (!entry) {
> + ret = -ENOENT;
> + goto out;
> + }
> + ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
> + BPF_F_REPLACE | BPF_F_ID,
> + link->prog->aux->id, 0);
> + if (!ret) {
> + WARN_ON_ONCE(entry != entry_new);
> + oprog = xchg(&link->prog, nprog);
> + bpf_prog_put(oprog);
> + bpf_mprog_commit(entry);
> + }
> +out:
> + rtnl_unlock();
> + return ret;
> +}
> +
> +static void netkit_link_dealloc(struct bpf_link *link)
> +{
> + kfree(netkit_link(link));
> +}
> +
> +static void netkit_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
> +{
> + const struct netkit_link *nkl = netkit_link(link);
> + u32 ifindex = 0;
> +
> + rtnl_lock();
> + if (nkl->dev)
> + ifindex = nkl->dev->ifindex;
> + rtnl_unlock();
> +
> + seq_printf(seq, "ifindex:\t%u\n", ifindex);
> + seq_printf(seq, "attach_type:\t%u (%s)\n",
> + nkl->location,
> + nkl->location == BPF_NETKIT_PRIMARY ? "primary" : "peer");
> +}
> +
> +static int netkit_link_fill_info(const struct bpf_link *link,
> + struct bpf_link_info *info)
> +{
> + const struct netkit_link *nkl = netkit_link(link);
> + u32 ifindex = 0;
> +
> + rtnl_lock();
> + if (nkl->dev)
> + ifindex = nkl->dev->ifindex;
> + rtnl_unlock();
> +
> + info->netkit.ifindex = ifindex;
> + info->netkit.attach_type = nkl->location;
> + return 0;
> +}
> +
> +static int netkit_link_detach(struct bpf_link *link)
> +{
> + netkit_link_release(link);
> + return 0;
> +}
> +
> +static const struct bpf_link_ops netkit_link_lops = {
> + .release = netkit_link_release,
> + .detach = netkit_link_detach,
> + .dealloc = netkit_link_dealloc,
> + .update_prog = netkit_link_update,
> + .show_fdinfo = netkit_link_fdinfo,
> + .fill_link_info = netkit_link_fill_info,
> +};
> +
> +static int netkit_link_init(struct netkit_link *nkl,
> + struct bpf_link_primer *link_primer,
> + const union bpf_attr *attr,
> + struct net_device *dev,
> + struct bpf_prog *prog)
> +{
> + bpf_link_init(&nkl->link, BPF_LINK_TYPE_NETKIT,
> + &netkit_link_lops, prog);
> + nkl->location = attr->link_create.attach_type;
> + nkl->dev = dev;
> + return bpf_link_prime(&nkl->link, link_primer);
> +}
> +
> +int netkit_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> + struct bpf_link_primer link_primer;
> + struct netkit_link *nkl;
> + struct net_device *dev;
> + int ret;
> +
> + rtnl_lock();
> + dev = netkit_dev_fetch(current->nsproxy->net_ns,
> + attr->link_create.target_ifindex,
> + attr->link_create.attach_type);
> + if (IS_ERR(dev)) {
> + ret = PTR_ERR(dev);
> + goto out;
> + }
> + nkl = kzalloc(sizeof(*nkl), GFP_KERNEL_ACCOUNT);
> + if (!nkl) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + ret = netkit_link_init(nkl, &link_primer, attr, dev, prog);
> + if (ret) {
> + kfree(nkl);
> + goto out;
> + }
The series looks great! FWIW:
Acked-by: Stanislav Fomichev <sdf@...gle.com>
One small question I have is:
We now (and after introduction of tcx) seem to store non-refcounted
dev pointers in the bpf_link(s). Is it guaranteed that the dev will
outlive the link?
> + ret = netkit_link_prog_attach(&nkl->link,
> + attr->link_create.flags,
> + attr->link_create.netkit.relative_fd,
> + attr->link_create.netkit.expected_revision);
> + if (ret) {
> + nkl->dev = NULL;
> + bpf_link_cleanup(&link_primer);
> + goto out;
What happens to nkl here? Do we leak it?
> + }
> + ret = bpf_link_settle(&link_primer);
> +out:
> + rtnl_unlock();
> + return ret;
> +}
Powered by blists - more mailing lists