[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a5b4456-0f3e-40b3-849c-930ebb97ef3c@lab.ntt.co.jp>
Date: Mon, 2 Jul 2018 11:45:33 +0900
From: Toshiaki Makita <makita.toshiaki@....ntt.co.jp>
To: Jason Wang <jasowang@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>
Cc: kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org,
netdev@...r.kernel.org,
Tonghao Zhang <zhangtonghao@...ichuxing.com>
Subject: Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll
Hi Jason,
On 2018/06/29 18:30, Jason Wang wrote:
> On 2018年06月29日 16:09, Toshiaki Makita wrote:
...
>> To fix this, poll the work instead of enabling notification when
>> busypoll is interrupted by something. IMHO signal_pending() and
>> vhost_has_work() are kind of interruptions rather than signals to
>> completely cancel the busypoll, so let's run busypoll after the
>> necessary work is done. In order to avoid too long busyloop due to
>> interruption, save the endtime in vq field and use it when reentering
>> the work function.
>
> I think we don't care long busyloop unless e.g tx can starve rx?
I just want to keep it user-controllable. Unless memorizing it busypoll
can run unexpectedly long.
>> I observed this problem makes tx performance poor but does not rx. I
>> guess it is because rx notification from socket is not that heavy so
>> did not impact the performance even when we failed to mask the
>> notification.
>
> I think the reason is:
>
> For tx, we may only process used ring under handle_tx(). Busy polling
> code can not recognize this case.
> For rx, we call peek_head_len() after exiting busy loop, so if the work
> is for rx, it can be processed in handle_rx() immediately.
Sorry but I don't see the difference. Tx busypoll calls
vhost_get_vq_desc() immediately after busypoll exits in
vhost_net_tx_get_vq_desc().
The scenario I described above (in 2nd paragraph) is a bit simplified.
To be clearer what I think is happening is:
1. handle_tx() runs busypoll with notification enabled (due to zerocopy
callback or something).
2. Guest produces a packet in vring.
3. handle_tx() busypoll detects the produced packet and get it.
4. While vhost is processing the packet, guest kicks vring because it
produced the packet. Vring notification is disabled automatically by
event_idx and tx work is queued.
5. After processing the packet vhost enters busyloop again, but detects
tx work and exits busyloop immediately. Then handle_tx() exits after
enabling the notification.
6. Because the queued work was tx, handle_tx() is called immediately
again. handle_tx() runs busypoll with notification enabled.
7. Repeat 2-6.
>
>> Anyway for consistency I fixed rx routine as well as tx.
>>
>> Performance numbers:
>>
>> - Bulk transfer from guest to external physical server.
>> [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
>
> Just to confirm in this case since zerocopy is enabled, we are in fact
> use the generic XDP datapath?
For some reason zerocopy was not applied for most packets, so in most
cases driver XDP was used. I was going to dig into it but not yet.
>
>> - Set 10us busypoll.
>> - Guest disables checksum and TSO because of host XDP.
>> - Measured single flow Mbps by netperf, and kicks by perf kvm stat
>> (EPT_MISCONFIG event).
>>
>> Before After
>> Mbps kicks/s Mbps kicks/s
>> UDP_STREAM 1472byte 247758 27
>> Send 3645.37 6958.10
>> Recv 3588.56 6958.10
>> 1byte 9865 37
>> Send 4.34 5.43
>> Recv 4.17 5.26
>> TCP_STREAM 8801.03 45794 9592.77 2884
>
> Impressive numbers.
>
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@....ntt.co.jp>
>> ---
...
>> +static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
>> +{
>> + return unlikely(signal_pending(current)) || vhost_has_work(dev);
>> }
>> static void vhost_net_disable_vq(struct vhost_net *n,
>> @@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct
>> vhost_net *net,
>> if (r == vq->num && vq->busyloop_timeout) {
>> preempt_disable();
>> - endtime = busy_clock() + vq->busyloop_timeout;
>> - while (vhost_can_busy_poll(vq->dev, endtime) &&
>> - vhost_vq_avail_empty(vq->dev, vq))
>> + if (vq->busyloop_endtime) {
>> + endtime = vq->busyloop_endtime;
>> + vq->busyloop_endtime = 0;
>
> Looks like endtime may be before current time. So we skip busy loop in
> this case.
Sure, I'll add a condition.
>
>> + } else {
>> + endtime = busy_clock() + vq->busyloop_timeout;
>> + }
>> + while (vhost_can_busy_poll(endtime)) {
>> + if (vhost_busy_poll_interrupted(vq->dev)) {
>> + vq->busyloop_endtime = endtime;
>> + break;
>> + }
>> + if (!vhost_vq_avail_empty(vq->dev, vq))
>> + break;
>> cpu_relax();
>> + }
>> preempt_enable();
>> r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>> out_num, in_num, NULL, NULL);
...
>> @@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct
>> vhost_net_virtqueue *nvq)
>> static int vhost_net_rx_peek_head_len(struct vhost_net *net,
>> struct sock *sk)
>> {
>> - struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
>> - struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>> - struct vhost_virtqueue *vq = &nvq->vq;
>> + struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
>> + struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
>> + struct vhost_virtqueue *rvq = &rnvq->vq;
>> + struct vhost_virtqueue *tvq = &tnvq->vq;
>
> NIt: I admit the variable name is confusing. It would be better to do
> the renaming in a separate patch to ease review.
OK.
>
>> unsigned long uninitialized_var(endtime);
>> - int len = peek_head_len(rvq, sk);
>> + int len = peek_head_len(rnvq, sk);
>> - if (!len && vq->busyloop_timeout) {
>> + if (!len && tvq->busyloop_timeout) {
>> /* Flush batched heads first */
>> - vhost_rx_signal_used(rvq);
>> + vhost_rx_signal_used(rnvq);
>> /* Both tx vq and rx socket were polled here */
>> - mutex_lock_nested(&vq->mutex, 1);
>> - vhost_disable_notify(&net->dev, vq);
>> + mutex_lock_nested(&tvq->mutex, 1);
>> + vhost_disable_notify(&net->dev, tvq);
>> preempt_disable();
>> - endtime = busy_clock() + vq->busyloop_timeout;
>> + if (rvq->busyloop_endtime) {
>> + endtime = rvq->busyloop_endtime;
>> + rvq->busyloop_endtime = 0;
>> + } else {
>> + endtime = busy_clock() + tvq->busyloop_timeout;
>> + }
>> - while (vhost_can_busy_poll(&net->dev, endtime) &&
>> - !sk_has_rx_data(sk) &&
>> - vhost_vq_avail_empty(&net->dev, vq))
>> + while (vhost_can_busy_poll(endtime)) {
>> + if (vhost_busy_poll_interrupted(&net->dev)) {
>> + rvq->busyloop_endtime = endtime;
>> + break;
>> + }
>> + if (sk_has_rx_data(sk) ||
>> + !vhost_vq_avail_empty(&net->dev, tvq))
>> + break;
>> cpu_relax();
>
> Actually, I plan to poll guest rx refilling as well here to avoid rx
> kicks. Want to draft another patch to do it as well? It's as simple as
> add a vhost_vq_avail_empty for rvq above.
OK. will try it.
>
>> + }
>> preempt_enable();
>> - if (!vhost_vq_avail_empty(&net->dev, vq))
>> - vhost_poll_queue(&vq->poll);
>> - else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>> - vhost_disable_notify(&net->dev, vq);
>> - vhost_poll_queue(&vq->poll);
>> + if (!vhost_vq_avail_empty(&net->dev, tvq)) {
>> + vhost_poll_queue(&tvq->poll);
>> + } else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
>> + vhost_disable_notify(&net->dev, tvq);
>> + vhost_poll_queue(&tvq->poll);
>> }
>> - mutex_unlock(&vq->mutex);
>> + mutex_unlock(&tvq->mutex);
>> - len = peek_head_len(rvq, sk);
>> + len = peek_head_len(rnvq, sk);
>> }
>> return len;
...
>> @@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net)
>> goto out;
>> }
>> }
>> - vhost_net_enable_vq(net, vq);
>> + if (unlikely(vq->busyloop_endtime)) {
>> + /* Busy poll is interrupted. */
>> + vhost_poll_queue(&vq->poll);
>> + } else {
>> + vhost_net_enable_vq(net, vq);
>> + }
>
> This is to reduce the rx wake ups. Better with another patch.
>
> So I suggest to split the patches into:
>
> 1 better naming of variable of vhost_net_rx_peek_head_len().
> 2 avoid tx kicks (most of what this patch did)
> 3 avoid rx wakeups (exactly the above 6 lines did)
> 4 avoid rx kicks
OK, I'll reorganize the patch.
Thank you for your feedback!
>
> Btw, tonghao is doing some refactoring of busy polling as well. Depends
> on the order of being merged, one of you may need rebasing.
Thanks for sharing. I'll take a look.
--
Toshiaki Makita
Powered by blists - more mailing lists