[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7f8a848-6b63-a4c4-469e-9c019a4cfc91@redhat.com>
Date: Wed, 29 Aug 2018 15:56:32 +0800
From: Jason Wang <jasowang@...hat.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: "Jon Olson (Google Drive)" <jonolson@...gle.com>,
"Michael S. Tsirkin" <mst@...hat.com>, caleb.raitto@...il.com,
David Miller <davem@...emloft.net>,
Network Development <netdev@...r.kernel.org>,
Caleb Raitto <caraitto@...gle.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
On 2018年08月29日 03:57, Willem de Bruijn wrote:
> On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@...hat.com> wrote:
>>
>>
>> On 2018年07月25日 08:17, Jon Olson wrote:
>>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@...hat.com> wrote:
>>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>>>> >From the above linked patch, I understand that there are yet
>>>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>>>> 32 regardless of number of vcpus.
>>>>>> I don't think upstream kernels have this limit - we can
>>>>>> now use vmalloc for higher number of queues.
>>>>> Yes. that patch* mentioned it as a google compute engine imposed
>>>>> limit. It is exactly such cloud provider imposed rules that I'm
>>>>> concerned about working around in upstream drivers.
>>>>>
>>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>>>> Yea. Why does GCE do it btw?
>>> There are a few reasons for the limit, some historical, some current.
>>>
>>> Historically we did this because of a kernel limit on the number of
>>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
>>> the limit upstream at the time we did it was actually eight. We had
>>> increased the limit from eight to 32 internally, and it appears in
>>> upstream it has subsequently increased upstream to 256. We no longer
>>> use TAP for networking, so that constraint no longer applies for us,
>>> but when looking at removing/raising the limit we discovered no
>>> workloads that clearly benefited from lifting it, and it also placed
>>> more pressure on our virtual networking stack particularly on the Tx
>>> side. We left it as-is.
>>>
>>> In terms of current reasons there are really two. One is memory usage.
>>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
>>> that the guest will have an Rx queue for every Tx queue. We run our
>>> individual virtqueues fairly deep (4096 entries) to give guests a wide
>>> time window for re-posting Rx buffers and avoiding starvation on
>>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
>>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
>>> be up to 512MB of memory posted for network buffers. Scaling this to
>>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
>>> all of the Rx rings full would (in the large average Rx packet size
>>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
>>> of RAM available, but I don't believe we've observed a situation where
>>> they would have benefited from having 2.5 gigs of buffers posted for
>>> incoming network traffic :)
>> We can work to have async txq and rxq instead of paris if there's a
>> strong requirement.
>>
>>> The second reason is interrupt related -- as I mentioned above, we
>>> have found no workloads that clearly benefit from so many queues, but
>>> we have found workloads that degrade. In particular workloads that do
>>> a lot of small packet processing but which aren't extremely latency
>>> sensitive can achieve higher PPS by taking fewer interrupt across
>>> fewer VCPUs due to better batching (this also incurs higher latency,
>>> but at the limit the "busy" cores end up suppressing most interrupts
>>> and spending most of their cycles farming out work). Memcache is a
>>> good example here, particularly if the latency targets for request
>>> completion are in the ~milliseconds range (rather than the
>>> microseconds we typically strive for with TCP_RR-style workloads).
>>>
>>> All of that said, we haven't been forthcoming with data (and
>>> unfortunately I don't have it handy in a useful form, otherwise I'd
>>> simply post it here), so I understand the hesitation to simply run
>>> with napi_tx across the board. As Willem said, this patch seemed like
>>> the least disruptive way to allow us to continue down the road of
>>> "universal" NAPI Tx and to hopefully get data across enough workloads
>>> (with VMs small, large, and absurdly large :) to present a compelling
>>> argument in one direction or another. As far as I know there aren't
>>> currently any NAPI related ethtool commands (based on a quick perusal
>>> of ethtool.h)
>> As I suggest before, maybe we can (ab)use tx-frames-irq.
> I forgot to respond to this originally, but I agree.
>
> How about something like the snippet below. It would be simpler to
> reason about if only allow switching while the device is down, but
> napi does not strictly require that.
>
> +static int virtnet_set_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec)
> +{
> + const u32 tx_coalesce_napi_mask = (1 << 16);
> + const struct ethtool_coalesce ec_default = {
> + .cmd = ETHTOOL_SCOALESCE,
> + .rx_max_coalesced_frames = 1,
> + .tx_max_coalesced_frames = 1,
> + };
> + struct virtnet_info *vi = netdev_priv(dev);
> + int napi_weight = 0;
> + bool running;
> + int i;
> +
> + if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> + ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> + napi_weight = NAPI_POLL_WEIGHT;
> + }
> +
> + /* disallow changes to fields not explicitly tested above */
> + if (memcmp(ec, &ec_default, sizeof(ec_default)))
> + return -EINVAL;
> +
> + if (napi_weight ^ vi->sq[0].napi.weight) {
> + running = netif_running(vi->dev);
> +
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + vi->sq[i].napi.weight = napi_weight;
> +
> + if (!running)
> + continue;
> +
> + if (napi_weight)
> + virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> + &vi->sq[i].napi);
> + else
> + napi_disable(&vi->sq[i].napi);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> + struct ethtool_coalesce *ec)
> +{
> + const u32 tx_coalesce_napi_mask = (1 << 16);
> + const struct ethtool_coalesce ec_default = {
> + .cmd = ETHTOOL_GCOALESCE,
> + .rx_max_coalesced_frames = 1,
> + .tx_max_coalesced_frames = 1,
> + };
> + struct virtnet_info *vi = netdev_priv(dev);
> +
> + memcpy(ec, &ec_default, sizeof(ec_default));
> +
> + if (vi->sq[0].napi.weight)
> + ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> +
> + return 0;
> +}
Looks good. Just one nit, maybe it's better simply check against zero?
Thanks
Powered by blists - more mailing lists