linux-kernel - Re: [PATCH net-next v2] tun: support NAPI for packets received from batched XDP buffs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c687e1d8-e36a-8f23-342a-22b2a1efb372@gmail.com>
Date:   Sun, 27 Feb 2022 20:06:03 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Harold Huang <baymaxhuang@...il.com>, netdev@...r.kernel.org
Cc:     jasowang@...hat.com, pabeni@...hat.com,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        John Fastabend <john.fastabend@...il.com>,
        open list <linux-kernel@...r.kernel.org>,
        "open list:XDP (eXpress Data Path)" <bpf@...r.kernel.org>,
        edumazet@...gle.com
Subject: Re: [PATCH net-next v2] tun: support NAPI for packets received from
 batched XDP buffs


On 2/25/22 01:02, Harold Huang wrote:
> In tun, NAPI is supported and we can also use NAPI in the path of
> batched XDP buffs to accelerate packet processing. What is more, after
> we use NAPI, GRO is also supported. The iperf shows that the throughput of
> single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
> Gbps nearly reachs the line speed of the phy nic and there is still about
> 15% idle cpu core remaining on the vhost thread.
>
> Test topology:
>
> [iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
>
> Iperf stream:
>
> Before:
> ...
> [  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
> [  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
> [  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
> [  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
> [  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
> [  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver
>
> After:
> ...
> [  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
> [  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
> [  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
> [  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
> [  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
> [  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver
> ....
>
> Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com
> Signed-off-by: Harold Huang <baymaxhuang@...il.com>
> ---
> v1 -> v2
>   - fix commit messages
>   - add queued flag to avoid void unnecessary napi suggested by Jason
>
>   drivers/net/tun.c | 20 ++++++++++++++++----
>   1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index fed85447701a..c7d8b7c821d8 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2379,7 +2379,7 @@ static void tun_put_page(struct tun_page *tpage)
>   }
>   
>   static int tun_xdp_one(struct tun_struct *tun,
> -		       struct tun_file *tfile,
> +		       struct tun_file *tfile, int *queued,
>   		       struct xdp_buff *xdp, int *flush,
>   		       struct tun_page *tpage)
>   {
> @@ -2388,6 +2388,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	struct virtio_net_hdr *gso = &hdr->gso;
>   	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb = NULL;
> +	struct sk_buff_head *queue;
>   	u32 rxhash = 0, act;
>   	int buflen = hdr->buflen;
>   	int err = 0;
> @@ -2464,7 +2465,15 @@ static int tun_xdp_one(struct tun_struct *tun,
>   	    !tfile->detached)
>   		rxhash = __skb_get_hash_symmetric(skb);
>   
> -	netif_receive_skb(skb);
> +	if (tfile->napi_enabled) {
> +		queue = &tfile->sk.sk_write_queue;
> +		spin_lock(&queue->lock);
> +		__skb_queue_tail(queue, skb);
> +		spin_unlock(&queue->lock);
> +		(*queued)++;
> +	} else {
> +		netif_receive_skb(skb);
> +	}
>   
>   	/* No need to disable preemption here since this function is
>   	 * always called with bh disabled
> @@ -2492,7 +2501,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>   	if (ctl && (ctl->type == TUN_MSG_PTR)) {
>   		struct tun_page tpage;
>   		int n = ctl->num;
> -		int flush = 0;
> +		int flush = 0, queued = 0;
>   
>   		memset(&tpage, 0, sizeof(tpage));
>   
> @@ -2501,12 +2510,15 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>   
>   		for (i = 0; i < n; i++) {
>   			xdp = &((struct xdp_buff *)ctl->ptr)[i];
> -			tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
> +			tun_xdp_one(tun, tfile, &queued, xdp, &flush, &tpage);


How big n can be ?

BTW I could not find where m->msg_controllen was checked in tun_sendmsg().

struct tun_msg_ctl *ctl = m->msg_control;

if (ctl && (ctl->type == TUN_MSG_PTR)) {

     int n = ctl->num;  // can be set to values in [0..65535]

     for (i = 0; i < n; i++) {

         xdp = &((struct xdp_buff *)ctl->ptr)[i];


I really do not understand how we prevent malicious user space from 
crashing the kernel.



>   		}
>   
>   		if (flush)
>   			xdp_do_flush();
>   
> +		if (tfile->napi_enabled && queued > 0)
> +			napi_schedule(&tfile->napi);
> +
>   		rcu_read_unlock();
>   		local_bh_enable();
>