netdev - Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a5b4456-0f3e-40b3-849c-930ebb97ef3c@lab.ntt.co.jp>
Date:   Mon, 2 Jul 2018 11:45:33 +0900
From:   Toshiaki Makita <makita.toshiaki@....ntt.co.jp>
To:     Jason Wang <jasowang@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>
Cc:     kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org,
        netdev@...r.kernel.org,
        Tonghao Zhang <zhangtonghao@...ichuxing.com>
Subject: Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll

Hi Jason,

On 2018/06/29 18:30, Jason Wang wrote:
> On 2018年06月29日 16:09, Toshiaki Makita wrote:
...
>> To fix this, poll the work instead of enabling notification when
>> busypoll is interrupted by something. IMHO signal_pending() and
>> vhost_has_work() are kind of interruptions rather than signals to
>> completely cancel the busypoll, so let's run busypoll after the
>> necessary work is done. In order to avoid too long busyloop due to
>> interruption, save the endtime in vq field and use it when reentering
>> the work function.
> 
> I think we don't care long busyloop unless e.g tx can starve rx?

I just want to keep it user-controllable. Unless memorizing it busypoll
can run unexpectedly long.

>> I observed this problem makes tx performance poor but does not rx. I
>> guess it is because rx notification from socket is not that heavy so
>> did not impact the performance even when we failed to mask the
>> notification.
> 
> I think the reason is:
> 
> For tx, we may only process used ring under handle_tx(). Busy polling
> code can not recognize this case.
> For rx, we call peek_head_len() after exiting busy loop, so if the work
> is for rx, it can be processed in handle_rx() immediately.

Sorry but I don't see the difference. Tx busypoll calls
vhost_get_vq_desc() immediately after busypoll exits in
vhost_net_tx_get_vq_desc().

The scenario I described above (in 2nd paragraph) is a bit simplified.
To be clearer what I think is happening is:

1. handle_tx() runs busypoll with notification enabled (due to zerocopy
   callback or something).
2. Guest produces a packet in vring.
3. handle_tx() busypoll detects the produced packet and get it.
4. While vhost is processing the packet, guest kicks vring because it
   produced the packet. Vring notification is disabled automatically by
   event_idx and tx work is queued.
5. After processing the packet vhost enters busyloop again, but detects
   tx work and exits busyloop immediately. Then handle_tx() exits after
   enabling the notification.
6. Because the queued work was tx, handle_tx() is called immediately
   again. handle_tx() runs busypoll with notification enabled.
7. Repeat 2-6.

> 
>> Anyway for consistency I fixed rx routine as well as tx.
>>
>> Performance numbers:
>>
>> - Bulk transfer from guest to external physical server.
>>      [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
> 
> Just to confirm in this case since zerocopy is enabled, we are in fact
> use the generic XDP datapath?

For some reason zerocopy was not applied for most packets, so in most
cases driver XDP was used. I was going to dig into it but not yet.

> 
>> - Set 10us busypoll.
>> - Guest disables checksum and TSO because of host XDP.
>> - Measured single flow Mbps by netperf, and kicks by perf kvm stat
>>    (EPT_MISCONFIG event).
>>
>>                              Before              After
>>                            Mbps  kicks/s      Mbps  kicks/s
>> UDP_STREAM 1472byte              247758                 27
>>                  Send   3645.37            6958.10
>>                  Recv   3588.56            6958.10
>>                1byte                9865                 37
>>                  Send      4.34               5.43
>>                  Recv      4.17               5.26
>> TCP_STREAM             8801.03    45794   9592.77     2884
> 
> Impressive numbers.
> 
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@....ntt.co.jp>
>> ---
...
>> +static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
>> +{
>> +    return unlikely(signal_pending(current)) || vhost_has_work(dev);
>>   }
>>     static void vhost_net_disable_vq(struct vhost_net *n,
>> @@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct
>> vhost_net *net,
>>         if (r == vq->num && vq->busyloop_timeout) {
>>           preempt_disable();
>> -        endtime = busy_clock() + vq->busyloop_timeout;
>> -        while (vhost_can_busy_poll(vq->dev, endtime) &&
>> -               vhost_vq_avail_empty(vq->dev, vq))
>> +        if (vq->busyloop_endtime) {
>> +            endtime = vq->busyloop_endtime;
>> +            vq->busyloop_endtime = 0;
> 
> Looks like endtime may be before current time. So we skip busy loop in
> this case.

Sure, I'll add a condition.

> 
>> +        } else {
>> +            endtime = busy_clock() + vq->busyloop_timeout;
>> +        }
>> +        while (vhost_can_busy_poll(endtime)) {
>> +            if (vhost_busy_poll_interrupted(vq->dev)) {
>> +                vq->busyloop_endtime = endtime;
>> +                break;
>> +            }
>> +            if (!vhost_vq_avail_empty(vq->dev, vq))
>> +                break;
>>               cpu_relax();
>> +        }
>>           preempt_enable();
>>           r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>>                         out_num, in_num, NULL, NULL);
...
>> @@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct
>> vhost_net_virtqueue *nvq)
>>     static int vhost_net_rx_peek_head_len(struct vhost_net *net,
>> struct sock *sk)
>>   {
>> -    struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
>> -    struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>> -    struct vhost_virtqueue *vq = &nvq->vq;
>> +    struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
>> +    struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
>> +    struct vhost_virtqueue *rvq = &rnvq->vq;
>> +    struct vhost_virtqueue *tvq = &tnvq->vq;
> 
> NIt: I admit the variable name is confusing. It would be better to do
> the renaming in a separate patch to ease review.

OK.

> 
>>       unsigned long uninitialized_var(endtime);
>> -    int len = peek_head_len(rvq, sk);
>> +    int len = peek_head_len(rnvq, sk);
>>   -    if (!len && vq->busyloop_timeout) {
>> +    if (!len && tvq->busyloop_timeout) {
>>           /* Flush batched heads first */
>> -        vhost_rx_signal_used(rvq);
>> +        vhost_rx_signal_used(rnvq);
>>           /* Both tx vq and rx socket were polled here */
>> -        mutex_lock_nested(&vq->mutex, 1);
>> -        vhost_disable_notify(&net->dev, vq);
>> +        mutex_lock_nested(&tvq->mutex, 1);
>> +        vhost_disable_notify(&net->dev, tvq);
>>             preempt_disable();
>> -        endtime = busy_clock() + vq->busyloop_timeout;
>> +        if (rvq->busyloop_endtime) {
>> +            endtime = rvq->busyloop_endtime;
>> +            rvq->busyloop_endtime = 0;
>> +        } else {
>> +            endtime = busy_clock() + tvq->busyloop_timeout;
>> +        }
>>   -        while (vhost_can_busy_poll(&net->dev, endtime) &&
>> -               !sk_has_rx_data(sk) &&
>> -               vhost_vq_avail_empty(&net->dev, vq))
>> +        while (vhost_can_busy_poll(endtime)) {
>> +            if (vhost_busy_poll_interrupted(&net->dev)) {
>> +                rvq->busyloop_endtime = endtime;
>> +                break;
>> +            }
>> +            if (sk_has_rx_data(sk) ||
>> +                !vhost_vq_avail_empty(&net->dev, tvq))
>> +                break;
>>               cpu_relax();
> 
> Actually, I plan to poll guest rx refilling as well here to avoid rx
> kicks. Want to draft another patch to do it as well? It's as simple as
> add a vhost_vq_avail_empty for rvq above.

OK. will try it.

> 
>> +        }
>>             preempt_enable();
>>   -        if (!vhost_vq_avail_empty(&net->dev, vq))
>> -            vhost_poll_queue(&vq->poll);
>> -        else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>> -            vhost_disable_notify(&net->dev, vq);
>> -            vhost_poll_queue(&vq->poll);
>> +        if (!vhost_vq_avail_empty(&net->dev, tvq)) {
>> +            vhost_poll_queue(&tvq->poll);
>> +        } else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
>> +            vhost_disable_notify(&net->dev, tvq);
>> +            vhost_poll_queue(&tvq->poll);
>>           }
>>   -        mutex_unlock(&vq->mutex);
>> +        mutex_unlock(&tvq->mutex);
>>   -        len = peek_head_len(rvq, sk);
>> +        len = peek_head_len(rnvq, sk);
>>       }
>>         return len;
...
>> @@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net)
>>               goto out;
>>           }
>>       }
>> -    vhost_net_enable_vq(net, vq);
>> +    if (unlikely(vq->busyloop_endtime)) {
>> +        /* Busy poll is interrupted. */
>> +        vhost_poll_queue(&vq->poll);
>> +    } else {
>> +        vhost_net_enable_vq(net, vq);
>> +    }
> 
> This is to reduce the rx wake ups. Better with another patch.
> 
> So I suggest to split the patches into:
> 
> 1 better naming of variable of vhost_net_rx_peek_head_len().
> 2 avoid tx kicks (most of what this patch did)
> 3 avoid rx wakeups (exactly the above 6 lines did)
> 4 avoid rx kicks

OK, I'll reorganize the patch.
Thank you for your feedback!

> 
> Btw, tonghao is doing some refactoring of busy polling as well. Depends
> on the order of being merged, one of you may need rebasing.

Thanks for sharing. I'll take a look.

-- 
Toshiaki Makita