[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ef53a5c-ad4e-fadd-b460-18b3e589ead9@redhat.com>
Date: Thu, 27 Dec 2018 17:39:21 +0800
From: Jason Wang <jasowang@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel
virtual address
On 2018/12/26 下午11:02, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
>> On 2018/12/25 下午8:50, Michael S. Tsirkin wrote:
>>> On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
>>>> On 2018/12/25 上午2:10, Michael S. Tsirkin wrote:
>>>>> On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/14 下午8:36, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
>>>>>>>> On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>>>>>>>>>> It was noticed that the copy_user() friends that was used to access
>>>>>>>>>> virtqueue metdata tends to be very expensive for dataplane
>>>>>>>>>> implementation like vhost since it involves lots of software check,
>>>>>>>>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>>>>>>>>> extra cost will be more obvious when transferring small packets.
>>>>>>>>>>
>>>>>>>>>> This patch tries to eliminate those overhead by pin vq metadata pages
>>>>>>>>>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>>>>>>>>>> those mappings and memory accessors are modified to use pointers to
>>>>>>>>>> access the metadata directly.
>>>>>>>>>>
>>>>>>>>>> Note, this was only done when device IOTLB is not enabled. We could
>>>>>>>>>> use similar method to optimize it in the future.
>>>>>>>>>>
>>>>>>>>>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>>>>>>>>>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>>>>>>>>>
>>>>>>>>>> Before: ~5.0Mpps
>>>>>>>>>> After: ~6.1Mpps
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jason Wang<jasowang@...hat.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> drivers/vhost/vhost.h | 11 +++
>>>>>>>>>> 2 files changed, 189 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>>>> index bafe39d2e637..1bd24203afb6 100644
>>>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>>>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>>>>>>> vq->indirect = NULL;
>>>>>>>>>> vq->heads = NULL;
>>>>>>>>>> vq->dev = dev;
>>>>>>>>>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>>>>>>>>>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>>>>>>>>>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>>>>>>>>>> mutex_init(&vq->mutex);
>>>>>>>>>> vhost_vq_reset(dev, vq);
>>>>>>>>>> if (vq->handle_kick)
>>>>>>>>>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>>>>>>>>>> spin_unlock(&dev->iotlb_lock);
>>>>>>>>>> }
>>>>>>>>>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>>>>>>>>>> + size_t size, int write)
>>>>>>>>>> +{
>>>>>>>>>> + struct page **pages;
>>>>>>>>>> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>>>>>>>>>> + int npinned;
>>>>>>>>>> + void *vaddr;
>>>>>>>>>> +
>>>>>>>>>> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>>>>>>>>>> + if (!pages)
>>>>>>>>>> + return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
>>>>>>>>>> + if (npinned != npages)
>>>>>>>>>> + goto err;
>>>>>>>>>> +
>>>>>>>>> As I said I have doubts about the whole approach, but this
>>>>>>>>> implementation in particular isn't a good idea
>>>>>>>>> as it keeps the page around forever.
>>>>>> The pages wil be released during set features.
>>>>>>
>>>>>>
>>>>>>>>> So no THP, no NUMA rebalancing,
>>>>>> For THP, we will probably miss 2 or 4 pages, but does this really matter
>>>>>> consider the gain we have?
>>>>> We as in vhost? networking isn't the only thing guest does.
>>>>> We don't even know if this guest does a lot of networking.
>>>>> You don't
>>>>> know what else is in this huge page. Can be something very important
>>>>> that guest touches all the time.
>>>> Well, the probability should be very small consider we usually give several
>>>> gigabytes to guest. The rest of the pages that doesn't sit in the same
>>>> hugepage with metadata can still be merged by THP. Anyway, I can test the
>>>> differences.
>>> Thanks!
>>>
>>>>>> For NUMA rebalancing, I'm even not quite sure if
>>>>>> it can helps for the case of IPC (vhost). It looks to me the worst case it
>>>>>> may cause page to be thrash between nodes if vhost and userspace are running
>>>>>> in two nodes.
>>>>> So again it's a gain for vhost but has a completely unpredictable effect on
>>>>> other functionality of the guest.
>>>>>
>>>>> That's what bothers me with this approach.
>>>> So:
>>>>
>>>> - The rest of the pages could still be balanced to other nodes, no?
>>>>
>>>> - try to balance metadata pages (belongs to co-operate processes) itself is
>>>> still questionable
>>> I am not sure why. It should be easy enough to force the VCPU and vhost
>>> to move (e.g. start them pinned to 1 cpu, then pin them to another one).
>>> Clearly sometimes this would be necessary for load balancing reasons.
>>
>> Yes, but it looks to me the part of motivation of auto NUMA is to avoid
>> manual pinning.
> ... of memory. Yes.
>
>
>>> With autonuma after a while (could take seconds but it will happen) the
>>> memory will migrate.
>>>
>> Yes. As you mentioned during the discuss, I wonder we could do it similarly
>> through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
>> Unpin and remove kvm_arch->apic_access_page")
> That would be a possible approach.
Yes, this looks possible, and the conversion seems not hard. Let me have
a try with this.
[...]
>>>>>>> I don't see how a kthread makes any difference. We do have a validation
>>>>>>> step which makes some difference.
>>>>>> The problem is not kthread but the address of userspace address. The
>>>>>> addresses of vq metadata tends to be consistent for a while, and vhost knows
>>>>>> they will be frequently. SMAP doesn't help too much in this case.
>>>>>>
>>>>>> Thanks.
>>>>> It's true for a real life applications but a malicious one
>>>>> can call the setup ioctls any number of times. And SMAP is
>>>>> all about malcious applications.
>>>> We don't do this in the path of ioctl, there's no context switch between
>>>> userspace and kernel in the worker thread. SMAP is used to prevent kernel
>>>> from accessing userspace pages unexpectedly which is not the case for
>>>> metadata access.
>>>>
>>>> Thanks
>>> OK let's forget smap for now.
>>
>> Some numbers I measured:
>>
>> On an old Sandy bridge machine without SMAP support. Remove speculation
>> barrier boost the performance from 4.6Mpps to 5.1Mpps
>>
>> On a newer Broadwell machine with SMAP support. Remove speculation barrier
>> only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
>> 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
>> only bypass SMAP for metadata).
>>
>> So it looks like for recent machine, SMAP becomes pain point when the copy
>> is short (e.g 64B) for high PPS.
>>
>> Thanks
> Thanks a lot for looking into this!
>
> So first of all users can just boot with nosmap, right?
> What's wrong with that?
Nothing wrong, just realize we had this kernel parameter.
> Yes it's not fine-grained but OTOH
> it's easy to understand.
>
> And I guess this confirms that if we are going to worry
> about smap enabled, we need to look into packet copies
> too, not just meta-data.
For packet copies, we can do batch copy which is pretty simple for the
case of XDP. I've already had patches for this.
>
> Vaguely could see a module option (off by default)
> where vhost basically does user_access_begin
> when it starts running, then uses unsafe accesses
> in vhost and tun and then user_access_end.
Using user_access_begin() is more tricky than imaged. E.g it requires:
- userspace address to be validated before through access_ok() [1]
- It doesn't support calling a function that does explicit schedule
since SMAP/PAN state is not maintained through schedule() [2]
[1] https://lwn.net/Articles/736348/
[2] https://lkml.org/lkml/2018/11/23/430
So calling user_access_begin() all the time when vhost is running seems
pretty dangerous.
For a better batched datacopy, I tend to build not only XDP but also skb
in vhost in the future.
Thanks
>
>
>>>>>>>> Packet or AF_XDP benefit from
>>>>>>>> accessing metadata directly, we should do it as well.
>>>>>>>>
>>>>>>>> Thanks
Powered by blists - more mailing lists