netdev - Re: [RFC net-next 11/14] tun: run XDP program in tx path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5eb791bf-1876-0b4b-f721-cb3c607f846c@gmail.com>
Date:   Fri, 20 Dec 2019 09:07:24 +0900
From:   Prashant Bhole <prashantbhole.linux@...il.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        "Michael S . Tsirkin" <mst@...hat.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Jason Wang <jasowang@...hat.com>,
        David Ahern <dsahern@...il.com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        John Fastabend <john.fastabend@...il.com>,
        Toshiaki Makita <toshiaki.makita1@...il.com>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        Andrii Nakryiko <andriin@...com>, netdev@...r.kernel.org,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>
Subject: Re: [RFC net-next 11/14] tun: run XDP program in tx path

Note: Resending my last response. It was not delivered to netdev list
due to some problem.

On 12/19/19 7:15 PM, Toke Høiland-Jørgensen wrote:
> Prashant Bhole <prashantbhole.linux@...il.com> writes:
> 
>> On 12/19/19 3:19 AM, Alexei Starovoitov wrote:
>>> On Wed, Dec 18, 2019 at 12:48:59PM +0100, Toke Høiland-Jørgensen wrote:
>>>> Jesper Dangaard Brouer <jbrouer@...hat.com> writes:
>>>>
>>>>> On Wed, 18 Dec 2019 17:10:47 +0900
>>>>> Prashant Bhole <prashantbhole.linux@...il.com> wrote:
>>>>>
>>>>>> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
>>>>>> +			 struct xdp_frame *frame)
>>>>>> +{
>>>>>> +	struct bpf_prog *xdp_prog;
>>>>>> +	struct tun_page tpage;
>>>>>> +	struct xdp_buff xdp;
>>>>>> +	u32 act = XDP_PASS;
>>>>>> +	int flush = 0;
>>>>>> +
>>>>>> +	xdp_prog = rcu_dereference(tun->xdp_tx_prog);
>>>>>> +	if (xdp_prog) {
>>>>>> +		xdp.data_hard_start = frame->data - frame->headroom;
>>>>>> +		xdp.data = frame->data;
>>>>>> +		xdp.data_end = xdp.data + frame->len;
>>>>>> +		xdp.data_meta = xdp.data - frame->metasize;
>>>>>
>>>>> You have not configured xdp.rxq, thus a BPF-prog accessing this will crash.
>>>>>
>>>>> For an XDP TX hook, I want us to provide/give BPF-prog access to some
>>>>> more information about e.g. the current tx-queue length, or TC-q number.
>>>>>
>>>>> Question to Daniel or Alexei, can we do this and still keep BPF_PROG_TYPE_XDP?
>>>>> Or is it better to introduce a new BPF prog type (enum bpf_prog_type)
>>>>> for XDP TX-hook ?
>>>>
>>>> I think a new program type would make the most sense. If/when we
>>>> introduce an XDP TX hook[0], it should have different semantics than the
>>>> regular XDP hook. I view the XDP TX hook as a hook that executes as the
>>>> very last thing before packets leave the interface. It should have
>>>> access to different context data as you say, but also I don't think it
>>>> makes sense to have XDP_TX and XDP_REDIRECT in an XDP_TX hook. And we
>>>> may also want to have a "throttle" return code; or maybe that could be
>>>> done via a helper?
>>>>
>>>> In any case, I don't think this "emulated RX hook on the other end of a
>>>> virtual device" model that this series introduces is the right semantics
>>>> for an XDP TX hook. I can see what you're trying to do, and for virtual
>>>> point-to-point links I think it may make sense to emulate the RX hook of
>>>> the "other end" on TX. However, form a UAPI perspective, I don't think
>>>> we should be calling this a TX hook; logically, it's still an RX hook
>>>> on the receive end.
>>>>
>>>> If you guys are up for evolving this design into a "proper" TX hook (as
>>>> outlined above an in [0]), that would be awesome, of course. But not
>>>> sure what constraints you have on your original problem? Do you
>>>> specifically need the "emulated RX hook for unmodified XDP programs"
>>>> semantics, or could your problem be solved with a TX hook with different
>>>> semantics?
>>>
>>> I agree with above.
>>> It looks more like existing BPF_PROG_TYPE_XDP, but attached to egress
>>> of veth/tap interface. I think only attachment point makes a difference.
>>> May be use expected_attach_type ?
>>> Then there will be no need to create new program type.
>>> BPF_PROG_TYPE_XDP will be able to access different fields depending
>>> on expected_attach_type. Like rx-queue length that Jesper is suggesting
>>> will be available only in such case and not for all BPF_PROG_TYPE_XDP progs.
>>> It can be reduced too. Like if there is no xdp.rxq concept for egress side
>>> of virtual device the access to that field can disallowed by the verifier.
>>> Could you also call it XDP_EGRESS instead of XDP_TX?
>>> I would like to reserve XDP_TX name to what Toke describes as XDP_TX.
>>>
>>
>>   From the discussion over this set, it makes sense to have new type of
>> program. As David suggested it will make a way for changes specific
>> to egress path.
>> On the other hand, XDP offload with virtio-net implementation is based
>> on "emulated RX hook". How about having this special behavior with
>> expected_attach_type?
> 
> Another thought I had re: this was that for these "special" virtual
> point-to-point devices we could extend the API to have an ATTACH_PEER
> flag. So if you have a pair of veth devices (veth0,veth1) connecting to
> each other, you could do either of:
> 
> bpf_set_link_xdp_fd(ifindex(veth0), prog_fd, 0);
> bpf_set_link_xdp_fd(ifindex(veth1), prog_fd, ATTACH_PEER);
> 
> to attach to veth0, and:
> 
> bpf_set_link_xdp_fd(ifindex(veth1), prog_fd, 0);
> bpf_set_link_xdp_fd(ifindex(veth0), prog_fd, ATTACH_PEER);
> 
> to attach to veth0.
> 
> This would allow to attach to a device without having the "other end"
> visible, and keep the "XDP runs on RX" semantics clear to userspace.
> Internally in the kernel we could then turn the "attach to peer"
> operation for a tun device into the "emulate on TX" thing you're already
> doing?
> 
> Would this work for your use case, do you think?
> 
> -Toke
> 

This is nice from UAPI point of view. It may work for veth case but
not for XDP offload with virtio-net. Please see the sequence when
a user program in the guest wants to offload a program to tun.

* User program wants to loads the program by setting offload flag and
   ifindex:

- map_offload_ops->alloc()
   virtio-net sends map info to qemu and it creates map on the host.
- prog_offload_ops->setup()
   New callback just to have a copy of unmodified program. It contains
   original map fds. We replace map fds with fds from the host side.
   Check the program for unsupported helpers calls.
- prog_offload_ops->finalize()
   Send the program to qemu and it loads the program to the host.

* User program calls bpf_set_link_xdp_fd()
   virtio-net handles XDP_PROG_SETUP_HW by sending a request to qemu.
   Qemu then attaches host side program fd to respective tun device by
   calling bpf_set_link_xdp_fd()

In above sequence there is no chance to use.

Here is how other ideas from this discussion can be used:

- Introduce BPF_PROG_TYPE_TX_XDP for egress path. Have a special
   behavior of emulating RX XDP using expected_attach_type flag.
- The emulated RX XDP will be restrictive in terms of helper calls.
- In offload case qemu will load the program BPF_PROG_TYPE_TX_XDP and
   set expected_attach_type.

What is your opinion about it? Does the driver implementing egress
XDP needs to know what kind of XDP program it is running?

Thanks,
Prashant