netdev - Re: [PATCHv3 bpf-next 2/4] xdp: extend xdp_redirect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87lfa3phj6.fsf@toke.dk>
Date:   Wed, 31 Mar 2021 15:41:17 +0200
From:   Toke Høiland-Jørgensen <toke@...hat.com>
To:     Hangbin Liu <liuhangbin@...il.com>, bpf@...r.kernel.org
Cc:     netdev@...r.kernel.org, Jiri Benc <jbenc@...hat.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Eelco Chaudron <echaudro@...hat.com>, ast@...nel.org,
        Daniel Borkmann <daniel@...earbox.net>,
        Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
        David Ahern <dsahern@...il.com>,
        Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Björn Töpel 
        <bjorn.topel@...il.com>, Hangbin Liu <liuhangbin@...il.com>
Subject: Re: [PATCHv3 bpf-next 2/4] xdp: extend xdp_redirect_map with
 broadcast support

Hangbin Liu <liuhangbin@...il.com> writes:

> This patch add two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to extend
> xdp_redirect_map for broadcast support.
>
> Keep the general data path in net/core/filter.c and the native data
> path in kernel/bpf/devmap.c so we can use direct calls to get better
> performace.
>
> Here is the performance result by using xdp_redirect_{map, map_multi} in
> sample/bpf and send pkts via pktgen cmd:
> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
>
> There are some drop back as we need to loop the map and get each interface.
>
> Version          | Test                                | Generic | Native
> 5.12 rc2         | redirect_map        i40e->i40e      |    2.0M |  9.8M
> 5.12 rc2         | redirect_map        i40e->veth      |    1.8M | 12.0M
> 5.12 rc2 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
> 5.12 rc2 + patch | redirect_map        i40e->veth      |    1.7M | 12.0M
> 5.12 rc2 + patch | redirect_map multi  i40e->i40e      |    1.6M |  7.8M
> 5.12 rc2 + patch | redirect_map multi  i40e->veth      |    1.4M |  9.3M
> 5.12 rc2 + patch | redirect_map multi  i40e->mlx4+veth |    1.0M |  3.4M
>
> Signed-off-by: Hangbin Liu <liuhangbin@...il.com>
> ---
> v3:
> a) Rebase the code on Björn's "bpf, xdp: Restructure redirect actions".
>    - Add struct bpf_map *map back to struct bpf_redirect_info as we need
>      it for multicast.
>    - Add bpf_clear_redirect_map() back for devmap.c
>    - Add devmap_lookup_elem() as we need it in general path.
> b) remove tmp_key in devmap_get_next_obj()
>
> v2: Fix flag renaming issue in v1
> ---
>  include/linux/bpf.h            |  22 ++++++
>  include/linux/filter.h         |  14 +++-
>  include/net/xdp.h              |   1 +
>  include/uapi/linux/bpf.h       |  17 ++++-
>  kernel/bpf/devmap.c            | 127 +++++++++++++++++++++++++++++++++
>  net/core/filter.c              |  92 +++++++++++++++++++++++-
>  net/core/xdp.c                 |  29 ++++++++
>  tools/include/uapi/linux/bpf.h |  17 ++++-
>  8 files changed, 310 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a25730eaa148..5dacb1a45a03 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1456,11 +1456,15 @@ struct sk_buff;
>  struct bpf_dtab_netdev;
>  struct bpf_cpu_map_entry;
>  
> +struct bpf_dtab_netdev *devmap_lookup_elem(struct bpf_map *map, u32 key);
>  void __dev_flush(void);
>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  		    struct net_device *dev_rx);
> +bool dst_dev_is_ingress(struct bpf_dtab_netdev *obj, int ifindex);
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, bool exclude_ingress);
>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
>  			     struct bpf_prog *xdp_prog);
>  bool dev_map_can_have_prog(struct bpf_map *map);
> @@ -1595,6 +1599,11 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
>  	return -EOPNOTSUPP;
>  }
>  
> +static inline struct net_device *devmap_lookup_elem(struct bpf_map *map, u32 key)
> +{
> +	return NULL;
> +}
> +
>  static inline bool dev_map_can_have_prog(struct bpf_map *map)
>  {
>  	return false;
> @@ -1622,6 +1631,19 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>  	return 0;
>  }
>  
> +static inline
> +bool dst_dev_is_ingress(struct bpf_dtab_netdev *obj, int ifindex)
> +{
> +	return false;
> +}
> +
> +static inline
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +			  struct bpf_map *map, bool exclude_ingress)
> +{
> +	return 0;
> +}
> +
>  struct sk_buff;
>  
>  static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index b2b85b2cad8e..434170aafd0d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -646,6 +646,7 @@ struct bpf_redirect_info {
>  	u32 flags;
>  	u32 tgt_index;
>  	void *tgt_value;
> +	struct bpf_map *map;
>  	u32 map_id;
>  	enum bpf_map_type map_type;
>  	u32 kern_flags;
> @@ -1479,11 +1480,11 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>  	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
>  	/* Lower bits of the flags are used as return code on lookup failure */
> -	if (unlikely(flags > XDP_TX))
> +	if (unlikely(flags & ~(BPF_F_ACTION_MASK | BPF_F_REDIR_MASK)))
>  		return XDP_ABORTED;
>  
>  	ri->tgt_value = lookup_elem(map, ifindex);
> -	if (unlikely(!ri->tgt_value)) {
> +	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
>  		/* If the lookup fails we want to clear out the state in the
>  		 * redirect_info struct completely, so that if an eBPF program
>  		 * performs multiple lookups, the last one always takes
> @@ -1491,13 +1492,20 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>  		 */
>  		ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
>  		ri->map_type = BPF_MAP_TYPE_UNSPEC;
> -		return flags;
> +		return flags & BPF_F_ACTION_MASK;
>  	}
>  
>  	ri->tgt_index = ifindex;
>  	ri->map_id = map->id;
>  	ri->map_type = map->map_type;
>  
> +	if ((map->map_type == BPF_MAP_TYPE_DEVMAP ||
> +	     map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) &&
> +	    (flags & BPF_F_BROADCAST)) {
> +		ri->flags = flags;

This, combined with this:

[...]

> +	if (ri->flags & BPF_F_BROADCAST) {
> +		map = READ_ONCE(ri->map);
> +		WRITE_ONCE(ri->map, NULL);
> +	}
> +
>  	switch (map_type) {
>  	case BPF_MAP_TYPE_DEVMAP:
>  		fallthrough;
>  	case BPF_MAP_TYPE_DEVMAP_HASH:
> -		err = dev_map_enqueue(fwd, xdp, dev);
> +		if (ri->flags & BPF_F_BROADCAST)
> +			err = dev_map_enqueue_multi(xdp, dev, map,
> +						    ri->flags & BPF_F_EXCLUDE_INGRESS);
> +		else
> +			err = dev_map_enqueue(fwd, xdp, dev);

Means that (since the flags value is never cleared again) once someone
has done a broadcast redirect, that's all they'll ever get until the
next reboot ;)

Also, the bpf_clear_redirect_map() call has no effect since the call to
dev_map_enqueue_multi() only checks the flags and not the value of the
map pointer before deciding which enqueue function to call.

To fix both of these, how about changing the logic so that:

- __bpf_xdp_redirect_map() sets the map pointer if the broadcast flag is
  set (and clears it if the flag isn't set!)

- xdp_do_redirect() does the READ_ONCE/WRITE_ONCE on ri->map
  unconditionally and then dispatches to dev_map_enqueue_multi() if the
  read resulted in a non-NULL pointer

Also, it should be invalid to set the broadcast flag with a map type
other than a devmap; __bpf_xdp_redirect_map() should check that.

And finally, with the changes above, you no longer need the broadcast
flag in do_redirect() at all, so you could just have a bool
ri->exclude_ingress that is set in the helper and pass that directly to
dev_map_enqueue_multi().

A selftest that validates that the above works as it's supposed to might
be nice as well (i.e., one that broadcasts and does a regular redirect
after one another)

-Toke