linux-kernel - Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42a63594-a4b9-1b7b-2566-544853880317@redhat.com>
Date:   Fri, 25 Jan 2019 17:21:31 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        aarcange@...hat.com
Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel
 virtual address


On 2019/1/25 上午11:03, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software checks,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets since
>> the time spent on metadata accessing become more significant.
>>
>> This patch tries to eliminate those overheads by accessing them
>> through kernel virtual address by vmap(). To make the pages can be
>> migrated, instead of pinning them through GUP, we use MMU notifiers to
>> invalidate vmaps and re-establish vmaps during each round of metadata
>> prefetching if necessary. For devices that doesn't use metadata
>> prefetching, the memory accessors fallback to normal copy_user()
>> implementation gracefully. The invalidation was synchronized with
>> datapath through vq mutex, and in order to avoid hold vq mutex during
>> range checking, MMU notifier was teared down when trying to modify vq
>> metadata.
>>
>> Another thing is kernel lacks efficient solution for tracking dirty
>> pages by vmap(), this will lead issues if vhost is using file backed
>> memory which needs care of writeback. This patch solves this issue by
>> just skipping the vma that is file backed and fallback to normal
>> copy_user() friends. This might introduce some overheads for file
>> backed users but consider this use case is rare we could do
>> optimizations on top.
>>
>> Note that this was only done when device IOTLB is not enabled. We
>> could use similar method to optimize it in the future.
>>
>> Tests shows at most about 22% improvement on TX PPS when using
>> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>>
>>          SMAP on | SMAP off
>> Before: 5.0Mpps | 6.6Mpps
>> After:  6.1Mpps | 7.4Mpps
>>
>> Signed-off-by: Jason Wang <jasowang@...hat.com>
>> ---
>>   drivers/vhost/vhost.c | 288 +++++++++++++++++++++++++++++++++++++++++-
>>   drivers/vhost/vhost.h |  13 ++
>>   mm/shmem.c            |   1 +
>>   3 files changed, 300 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 37e2cac8e8b0..096ae3298d62 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   		vq->indirect = NULL;
>>   		vq->heads = NULL;
>>   		vq->dev = dev;
>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>   		mutex_init(&vq->mutex);
>>   		vhost_vq_reset(dev, vq);
>>   		if (vq->handle_kick)
>> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
>>   	return sizeof(*vq->desc) * num;
>>   }
>>   
>> +static void vhost_uninit_vmap(struct vhost_vmap *map)
>> +{
>> +	if (map->addr)
>> +		vunmap(map->unmap_addr);
>> +
>> +	map->addr = NULL;
>> +	map->unmap_addr = NULL;
>> +}
>> +
>> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
>> +				 struct vhost_vmap *map,
>> +				 unsigned long ustart,
>> +				 size_t size,
>> +				 unsigned long start,
>> +				 unsigned long end,
>> +				 bool blockable)
>> +{
>> +	if (end < ustart || start > ustart - 1 + size)
>> +		return 0;
>> +
>> +	if (!blockable)
>> +		return -EAGAIN;
>> +
>> +	mutex_lock(&vq->mutex);
>> +	vhost_uninit_vmap(map);
>> +	mutex_unlock(&vq->mutex);
>> +
>> +	return 0;
>> +}
>> +
>> +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
>> +					const struct mmu_notifier_range *range)
>> +{
>> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>> +					     mmu_notifier);
>> +	int i;
>> +
>> +	for (i = 0; i < dev->nvqs; i++) {
>> +		struct vhost_virtqueue *vq = dev->vqs[i];
>> +
>> +		if (vhost_invalidate_vmap(vq, &vq->avail_ring,
>> +					  (unsigned long)vq->avail,
>> +					  vhost_get_avail_size(vq, vq->num),
>> +					  range->start, range->end,
>> +					  range->blockable))
>> +			return -EAGAIN;
>> +		if (vhost_invalidate_vmap(vq, &vq->desc_ring,
>> +					  (unsigned long)vq->desc,
>> +					  vhost_get_desc_size(vq, vq->num),
>> +					  range->start, range->end,
>> +					  range->blockable))
>> +			return -EAGAIN;
>> +		if (vhost_invalidate_vmap(vq, &vq->used_ring,
>> +					  (unsigned long)vq->used,
>> +					  vhost_get_used_size(vq, vq->num),
>> +					  range->start, range->end,
>> +					  range->blockable))
>> +			return -EAGAIN;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
>> +	.invalidate_range_start = vhost_invalidate_range_start,
>> +};
>> +
>>   /* Caller should have device mutex */
>>   long vhost_dev_set_owner(struct vhost_dev *dev)
>>   {
> It seems questionable to merely track .invalidate_range_start.
> Don't we care about keeping pages young/accessed?


My understanding is the young stuffs were only needed for secondary MMU 
where the hva is not used. This is not the case of vhost since anyway 
guest will access those pages through userspace address.


> MMU will think they aren't and will penalize vhost by pushing
> them out.
>
> I note that MMU documentation says
>          * invalidate_range_start() and invalidate_range_end() must be
>           * paired
> and it seems questionable that they are not paired here.


I can see some users with the unpaired invalidate_range_start(). Maybe I 
miss something but I can not find anything that we need to do after the 
page is unmaped.


>
>
> I also wonder about things like write-protecting the pages.
> It does not look like a range is invalidated when page
> is write-protected, even though I might have missed that.
> If not we can be corrupting memory in a variety of ways
> e.g. when using KSM, or with COW.


Yes, we probably need to implement change_pte() method which will do 
vunmap().

Thanks