netdev - Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ef53a5c-ad4e-fadd-b460-18b3e589ead9@redhat.com>
Date:   Thu, 27 Dec 2018 17:39:21 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel
 virtual address


On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>
>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>> access the metadata directly.
>>>>>>>>>>
>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>
>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>
>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>> After:  ~6.1Mpps
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@...hat.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       drivers/vhost/vhost.h |  11 +++
>>>>>>>>>>       2 files changed, 189 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>>       		vq->indirect = NULL;
>>>>>>>>>>       		vq->heads = NULL;
>>>>>>>>>>       		vq->dev = dev;
>>>>>>>>>> +		memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>> +		memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>> +		memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>>       		mutex_init(&vq->mutex);
>>>>>>>>>>       		vhost_vq_reset(dev, vq);
>>>>>>>>>>       		if (vq->handle_kick)
>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>>       	spin_unlock(&dev->iotlb_lock);
>>>>>>>>>>       }
>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>> +			   size_t size, int write)
>>>>>>>>>> +{
>>>>>>>>>> +	struct page **pages;
>>>>>>>>>> +	int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>> +	int npinned;
>>>>>>>>>> +	void *vaddr;
>>>>>>>>>> +
>>>>>>>>>> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> +	if (!pages)
>>>>>>>>>> +		return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +	npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>> +	if (npinned != npages)
>>>>>>>>>> +		goto err;
>>>>>>>>>> +
>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>> as it keeps the page around forever.
>>>>>> The pages wil be released during set features.
>>>>>>
>>>>>>
>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>> consider the gain we have?
>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>> We don't even know if this guest does a lot of networking.
>>>>> You don't
>>>>> know what else is in this huge page. Can be something very important
>>>>> that guest touches all the time.
>>>> Well, the probability should be very small consider we usually give several
>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>> hugepage with metadata can still be merged by THP.  Anyway, I can test the
>>>> differences.
>>> Thanks!
>>>
>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>> in two nodes.
>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>> other functionality of the guest.
>>>>>
>>>>> That's what bothers me with this approach.
>>>> So:
>>>>
>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>
>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>> still questionable
>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>> Clearly sometimes this would be necessary for load balancing reasons.
>>
>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>> manual pinning.
> ... of memory. Yes.
>
>
>>> With autonuma after a while (could take seconds but it will happen) the
>>> memory will migrate.
>>>
>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>> Unpin and remove kvm_arch->apic_access_page")
> That would be a possible approach.


Yes, this looks possible, and the conversion seems not hard. Let me have 
a try with this.


[...]


>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>> step which makes some difference.
>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>
>>>>>> Thanks.
>>>>> It's true for a real life applications but a malicious one
>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>> all about malcious applications.
>>>> We don't do this in the path of ioctl, there's no context switch between
>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>> from accessing userspace pages unexpectedly which is not the case for
>>>> metadata access.
>>>>
>>>> Thanks
>>> OK let's forget smap for now.
>>
>> Some numbers I measured:
>>
>> On an old Sandy bridge machine without SMAP support. Remove speculation
>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>
>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>> only bypass SMAP for metadata).
>>
>> So it looks like for recent machine, SMAP becomes pain point when the copy
>> is short (e.g 64B) for high PPS.
>>
>> Thanks
> Thanks a lot for looking into this!
>
> So first of all users can just boot with nosmap, right?
> What's wrong with that?


Nothing wrong, just realize we had this kernel parameter.


>   Yes it's not fine-grained but OTOH
> it's easy to understand.
>
> And I guess this confirms that if we are going to worry
> about smap enabled, we need to look into packet copies
> too, not just meta-data.


For packet copies, we can do batch copy which is pretty simple for the 
case of XDP. I've already had patches for this.


>
> Vaguely could see a module option (off by default)
> where vhost basically does user_access_begin
> when it starts running, then uses unsafe accesses
> in vhost and tun and then user_access_end.


Using user_access_begin() is more tricky than imaged. E.g it requires:

- userspace address to be validated before through access_ok() [1]

- It doesn't support calling a function that does explicit schedule 
since SMAP/PAN state is not maintained through schedule() [2]

[1] https://lwn.net/Articles/736348/

[2] https://lkml.org/lkml/2018/11/23/430

So calling user_access_begin() all the time when vhost is running seems 
pretty dangerous.

For a better batched datacopy, I tend to build not only XDP but also skb 
in vhost in the future.

Thanks


>
>
>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>
>>>>>>>> Thanks