netdev - Re: [PATCH net-next] virtio_net: force_napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e71141c4-fb33-cdcb-635a-5ce40267702f@redhat.com>
Date:   Mon, 10 Sep 2018 13:59:22 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     "Jon Olson (Google Drive)" <jonolson@...gle.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, caleb.raitto@...il.com,
        David Miller <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>,
        Caleb Raitto <caraitto@...gle.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.



On 2018年09月10日 07:07, Willem de Bruijn wrote:
> On Wed, Aug 29, 2018 at 9:01 AM Willem de Bruijn
> <willemdebruijn.kernel@...il.com> wrote:
>> On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@...hat.com> wrote:
>>>
>>>
>>> On 2018年08月29日 03:57, Willem de Bruijn wrote:
>>>> On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@...hat.com> wrote:
>>>>>
>>>>> On 2018年07月25日 08:17, Jon Olson wrote:
>>>>>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>>>>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>>>>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>>>>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>>>>>>> >From the above linked patch, I understand that there are yet
>>>>>>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>>>>>>> 32 regardless of number of vcpus.
>>>>>>>>> I don't think upstream kernels have this limit - we can
>>>>>>>>> now use vmalloc for higher number of queues.
>>>>>>>> Yes. that patch* mentioned it as a google compute engine imposed
>>>>>>>> limit. It is exactly such cloud provider imposed rules that I'm
>>>>>>>> concerned about working around in upstream drivers.
>>>>>>>>
>>>>>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>>>>>>> Yea. Why does GCE do it btw?
>>>>>> There are a few reasons for the limit, some historical, some current.
>>>>>>
>>>>>> Historically we did this because of a kernel limit on the number of
>>>>>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
>>>>>> the limit upstream at the time we did it was actually eight. We had
>>>>>> increased the limit from eight to 32 internally, and it appears in
>>>>>> upstream it has subsequently increased upstream to 256. We no longer
>>>>>> use TAP for networking, so that constraint no longer applies for us,
>>>>>> but when looking at removing/raising the limit we discovered no
>>>>>> workloads that clearly benefited from lifting it, and it also placed
>>>>>> more pressure on our virtual networking stack particularly on the Tx
>>>>>> side. We left it as-is.
>>>>>>
>>>>>> In terms of current reasons there are really two. One is memory usage.
>>>>>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
>>>>>> that the guest will have an Rx queue for every Tx queue. We run our
>>>>>> individual virtqueues fairly deep (4096 entries) to give guests a wide
>>>>>> time window for re-posting Rx buffers and avoiding starvation on
>>>>>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
>>>>>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
>>>>>> be up to 512MB of memory posted for network buffers. Scaling this to
>>>>>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
>>>>>> all of the Rx rings full would (in the large average Rx packet size
>>>>>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
>>>>>> of RAM available, but I don't believe we've observed a situation where
>>>>>> they would have benefited from having 2.5 gigs of buffers posted for
>>>>>> incoming network traffic :)
>>>>> We can work to have async txq and rxq instead of paris if there's a
>>>>> strong requirement.
>>>>>
>>>>>> The second reason is interrupt related -- as I mentioned above, we
>>>>>> have found no workloads that clearly benefit from so many queues, but
>>>>>> we have found workloads that degrade. In particular workloads that do
>>>>>> a lot of small packet processing but which aren't extremely latency
>>>>>> sensitive can achieve higher PPS by taking fewer interrupt across
>>>>>> fewer VCPUs due to better batching (this also incurs higher latency,
>>>>>> but at the limit the "busy" cores end up suppressing most interrupts
>>>>>> and spending most of their cycles farming out work). Memcache is a
>>>>>> good example here, particularly if the latency targets for request
>>>>>> completion are in the ~milliseconds range (rather than the
>>>>>> microseconds we typically strive for with TCP_RR-style workloads).
>>>>>>
>>>>>> All of that said, we haven't been forthcoming with data (and
>>>>>> unfortunately I don't have it handy in a useful form, otherwise I'd
>>>>>> simply post it here), so I understand the hesitation to simply run
>>>>>> with napi_tx across the board. As Willem said, this patch seemed like
>>>>>> the least disruptive way to allow us to continue down the road of
>>>>>> "universal" NAPI Tx and to hopefully get data across enough workloads
>>>>>> (with VMs small, large, and absurdly large :) to present a compelling
>>>>>> argument in one direction or another. As far as I know there aren't
>>>>>> currently any NAPI related ethtool commands (based on a quick perusal
>>>>>> of ethtool.h)
>>>>> As I suggest before, maybe we can (ab)use tx-frames-irq.
>>>> I forgot to respond to this originally, but I agree.
>>>>
>>>> How about something like the snippet below. It would be simpler to
>>>> reason about if only allow switching while the device is down, but
>>>> napi does not strictly require that.
>>>>
>>>> +static int virtnet_set_coalesce(struct net_device *dev,
>>>> +                               struct ethtool_coalesce *ec)
>>>> +{
>>>> +       const u32 tx_coalesce_napi_mask = (1 << 16);
>>>> +       const struct ethtool_coalesce ec_default = {
>>>> +               .cmd = ETHTOOL_SCOALESCE,
>>>> +               .rx_max_coalesced_frames = 1,
>>>> +               .tx_max_coalesced_frames = 1,
>>>> +       };
>>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>>> +       int napi_weight = 0;
>>>> +       bool running;
>>>> +       int i;
>>>> +
>>>> +       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
>>>> +               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
>>>> +               napi_weight = NAPI_POLL_WEIGHT;
>>>> +       }
>>>> +
>>>> +       /* disallow changes to fields not explicitly tested above */
>>>> +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
>>>> +               return -EINVAL;
>>>> +
>>>> +       if (napi_weight ^ vi->sq[0].napi.weight) {
>>>> +               running = netif_running(vi->dev);
>>>> +
>>>> +               for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> +                       vi->sq[i].napi.weight = napi_weight;
>>>> +
>>>> +                       if (!running)
>>>> +                               continue;
>>>> +
>>>> +                       if (napi_weight)
>>>> +                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
>>>> +                                                      &vi->sq[i].napi);
>>>> +                       else
>>>> +                               napi_disable(&vi->sq[i].napi);
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static int virtnet_get_coalesce(struct net_device *dev,
>>>> +                               struct ethtool_coalesce *ec)
>>>> +{
>>>> +       const u32 tx_coalesce_napi_mask = (1 << 16);
>>>> +       const struct ethtool_coalesce ec_default = {
>>>> +               .cmd = ETHTOOL_GCOALESCE,
>>>> +               .rx_max_coalesced_frames = 1,
>>>> +               .tx_max_coalesced_frames = 1,
>>>> +       };
>>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>>> +
>>>> +       memcpy(ec, &ec_default, sizeof(ec_default));
>>>> +
>>>> +       if (vi->sq[0].napi.weight)
>>>> +               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
>>>> +
>>>> +       return 0;
>>>> +}
>>> Looks good. Just one nit, maybe it's better simply check against zero?
>> I wanted to avoid making napi and interrupt moderation mutually
>> exclusive. If the virtio-net driver ever gets true moderation support,
>> it should be able to work alongside napi.
>>
>> But I can make no-napi be 0 and napi be 1. That is future proof, in
>> the sense that napi is enabled if there is any interrupt moderation.
> It's not appearing on patchwork yet, but I just sent a patch.
>
> I implemented the above, but .tx_frames of 0 is technically incorrect
> and it would unnecessarily constrain interrupt moderation to one of two
> modes. I went back to using a high bit. That said, if you feel strongly
> I'll change it.

Rethink about this, how about something like:

- UINT_MAX: no tx interrupt
- other value: tx interrupt with possible interrupt moderation

>
> I also tried various ways of switching between napi and non napi
> mode without bringing the device down. This is quite fragile. At the
> very least napi.weight has to be updated without any interrupt or
> napi callback happening in between. So most of the datapath needs
> to be quiesced.
>
> I did code up a variant that manually stops all the queues, masks the
> interrupt and waits for napi to complete if scheduled. But in a stress
> test it still managed to trigger a BUG in napi_enable on this state.
>
> Napi is not switched at runtime in other devices, nor really needed. So
> instead I made this change conditional on the device being down.

I agree to start with this, but I cook a patch on top. Please refer the 
thread of formal patch.

Thanks