linux-kernel - Re: [PATCH net,stable v2] vhost: fix skb leak in handle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f6f050d-381d-c123-8cc2-16423e205fb0@redhat.com>
Date:   Thu, 30 Nov 2017 10:46:17 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>, wexu@...hat.com
Cc:     virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, mjrosato@...ux.vnet.ibm.com
Subject: Re: [PATCH net,stable v2] vhost: fix skb leak in handle_rx()



On 2017年11月29日 23:31, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 09:23:24AM -0500,wexu@...hat.com  wrote:
>> From: Wei Xu<wexu@...hat.com>
>>
>> Matthew found a roughly 40% tcp throughput regression with commit
>> c67df11f(vhost_net: try batch dequing from skb array) as discussed
>> in the following thread:
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html
>>
>> Eventually we figured out that it was a skb leak in handle_rx()
>> when sending packets to the VM. This usually happens when a guest
>> can not drain out vq as fast as vhost fills in, afterwards it sets
>> off the traffic jam and leaks skb(s) which occurs as no headcount
>> to send on the vq from vhost side.
>>
>> This can be avoided by making sure we have got enough headcount
>> before actually consuming a skb from the batched rx array while
>> transmitting, which is simply done by moving checking the zero
>> headcount a bit ahead.
>>
>> Also strengthen the small possibility of leak in case of recvmsg()
>> fails by freeing the skb.
>>
>> Signed-off-by: Wei Xu<wexu@...hat.com>
>> Reported-by: Matthew Rosato<mjrosato@...ux.vnet.ibm.com>
>> ---
>>   drivers/vhost/net.c | 23 +++++++++++++----------
>>   1 file changed, 13 insertions(+), 10 deletions(-)
>>
>> v2:
>> - add Matthew as the reporter, thanks matthew.
>> - moving zero headcount check ahead instead of defer consuming skb
>>    due to jason and mst's comment.
>> - add freeing skb in favor of recvmsg() fails.
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 8d626d7..e302e08 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -778,16 +778,6 @@ static void handle_rx(struct vhost_net *net)
>>   		/* On error, stop handling until the next kick. */
>>   		if (unlikely(headcount < 0))
>>   			goto out;
>> -		if (nvq->rx_array)
>> -			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
>> -		/* On overrun, truncate and discard */
>> -		if (unlikely(headcount > UIO_MAXIOV)) {
>> -			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
>> -			err = sock->ops->recvmsg(sock, &msg,
>> -						 1, MSG_DONTWAIT | MSG_TRUNC);
>> -			pr_debug("Discarded rx packet: len %zd\n", sock_len);
>> -			continue;
>> -		}
>>   		/* OK, now we need to know about added descriptors. */
>>   		if (!headcount) {
>>   			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>> @@ -800,6 +790,18 @@ static void handle_rx(struct vhost_net *net)
>>   			 * they refilled. */
>>   			goto out;
>>   		}
>> +		if (nvq->rx_array)
>> +			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
>> +		/* On overrun, truncate and discard */
>> +		if (unlikely(headcount > UIO_MAXIOV)) {
>> +			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
>> +			err = sock->ops->recvmsg(sock, &msg,
>> +						 1, MSG_DONTWAIT | MSG_TRUNC);
>> +			if (unlikely(err != 1))
> Why 1? How is receiving 1 byte special or even possible?
> Also, I wouldn't put an unlikely here. It's all error handling code anyway.
>
>> +				kfree_skb((struct sk_buff *)msg.msg_control);
> You do not need a cast here.
> Also, is it really safe to refer to msg_control here?
> I'd rather keep a copy of the skb pointer and use it than assume
> caller did not change it. But also see below.
>
>> +			pr_debug("Discarded rx packet: len %zd\n", sock_len);
>> +			continue;
>> +		}
>>   		/* We don't need to be notified again. */
>>   		iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
>>   		fixup = msg.msg_iter;
>> @@ -818,6 +820,7 @@ static void handle_rx(struct vhost_net *net)
>>   			pr_debug("Discarded rx packet: "
>>   				 " len %d, expected %zd\n", err, sock_len);
>>   			vhost_discard_vq_desc(vq, headcount);
>> +			kfree_skb((struct sk_buff *)msg.msg_control);
> You do not need a cast here.
>
> Also, we have
>
>          ret = tun_put_user(tun, tfile, skb, to);
>          if (unlikely(ret < 0))
>                  kfree_skb(skb);
>          else
>                  consume_skb(skb);
>
>          return ret;
>
> So it looks like recvmsg actually always consumes the skb.
> So I was wrong when I said you need to kfree it after
> recv msg, and your original patch was good.
>
> Jason, what do you think?
>

tun_recvmsg() has the following check:

static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t 
total_len,
                int flags)
{
     struct tun_file *tfile = container_of(sock, struct tun_file, socket);
     struct tun_struct *tun = __tun_get(tfile);
     int ret;

     if (!tun)
         return -EBADFD;

     if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
         ret = -EINVAL;
         goto out;
     }

And tun_do_read() has:

     if (!iov_iter_count(to))
         return 0;

So I think we need free skb in those cases.

Thanks