[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-L+psndKfTFS0S33ZvO7NP4wFO=-Gr5aYw22_oqkx+smA@mail.gmail.com>
Date: Wed, 29 Aug 2018 09:01:44 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jason Wang <jasowang@...hat.com>
Cc: "Jon Olson (Google Drive)" <jonolson@...gle.com>,
"Michael S. Tsirkin" <mst@...hat.com>, caleb.raitto@...il.com,
David Miller <davem@...emloft.net>,
Network Development <netdev@...r.kernel.org>,
Caleb Raitto <caraitto@...gle.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@...hat.com> wrote:
>
>
>
> On 2018年08月29日 03:57, Willem de Bruijn wrote:
> > On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@...hat.com> wrote:
> >>
> >>
> >> On 2018年07月25日 08:17, Jon Olson wrote:
> >>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> >>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> >>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> >>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> >>>>>>> >From the above linked patch, I understand that there are yet
> >>>>>>> other special cases in production, such as a hard cap on #tx queues to
> >>>>>>> 32 regardless of number of vcpus.
> >>>>>> I don't think upstream kernels have this limit - we can
> >>>>>> now use vmalloc for higher number of queues.
> >>>>> Yes. that patch* mentioned it as a google compute engine imposed
> >>>>> limit. It is exactly such cloud provider imposed rules that I'm
> >>>>> concerned about working around in upstream drivers.
> >>>>>
> >>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> >>>> Yea. Why does GCE do it btw?
> >>> There are a few reasons for the limit, some historical, some current.
> >>>
> >>> Historically we did this because of a kernel limit on the number of
> >>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> >>> the limit upstream at the time we did it was actually eight. We had
> >>> increased the limit from eight to 32 internally, and it appears in
> >>> upstream it has subsequently increased upstream to 256. We no longer
> >>> use TAP for networking, so that constraint no longer applies for us,
> >>> but when looking at removing/raising the limit we discovered no
> >>> workloads that clearly benefited from lifting it, and it also placed
> >>> more pressure on our virtual networking stack particularly on the Tx
> >>> side. We left it as-is.
> >>>
> >>> In terms of current reasons there are really two. One is memory usage.
> >>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> >>> that the guest will have an Rx queue for every Tx queue. We run our
> >>> individual virtqueues fairly deep (4096 entries) to give guests a wide
> >>> time window for re-posting Rx buffers and avoiding starvation on
> >>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> >>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> >>> be up to 512MB of memory posted for network buffers. Scaling this to
> >>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> >>> all of the Rx rings full would (in the large average Rx packet size
> >>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> >>> of RAM available, but I don't believe we've observed a situation where
> >>> they would have benefited from having 2.5 gigs of buffers posted for
> >>> incoming network traffic :)
> >> We can work to have async txq and rxq instead of paris if there's a
> >> strong requirement.
> >>
> >>> The second reason is interrupt related -- as I mentioned above, we
> >>> have found no workloads that clearly benefit from so many queues, but
> >>> we have found workloads that degrade. In particular workloads that do
> >>> a lot of small packet processing but which aren't extremely latency
> >>> sensitive can achieve higher PPS by taking fewer interrupt across
> >>> fewer VCPUs due to better batching (this also incurs higher latency,
> >>> but at the limit the "busy" cores end up suppressing most interrupts
> >>> and spending most of their cycles farming out work). Memcache is a
> >>> good example here, particularly if the latency targets for request
> >>> completion are in the ~milliseconds range (rather than the
> >>> microseconds we typically strive for with TCP_RR-style workloads).
> >>>
> >>> All of that said, we haven't been forthcoming with data (and
> >>> unfortunately I don't have it handy in a useful form, otherwise I'd
> >>> simply post it here), so I understand the hesitation to simply run
> >>> with napi_tx across the board. As Willem said, this patch seemed like
> >>> the least disruptive way to allow us to continue down the road of
> >>> "universal" NAPI Tx and to hopefully get data across enough workloads
> >>> (with VMs small, large, and absurdly large :) to present a compelling
> >>> argument in one direction or another. As far as I know there aren't
> >>> currently any NAPI related ethtool commands (based on a quick perusal
> >>> of ethtool.h)
> >> As I suggest before, maybe we can (ab)use tx-frames-irq.
> > I forgot to respond to this originally, but I agree.
> >
> > How about something like the snippet below. It would be simpler to
> > reason about if only allow switching while the device is down, but
> > napi does not strictly require that.
> >
> > +static int virtnet_set_coalesce(struct net_device *dev,
> > + struct ethtool_coalesce *ec)
> > +{
> > + const u32 tx_coalesce_napi_mask = (1 << 16);
> > + const struct ethtool_coalesce ec_default = {
> > + .cmd = ETHTOOL_SCOALESCE,
> > + .rx_max_coalesced_frames = 1,
> > + .tx_max_coalesced_frames = 1,
> > + };
> > + struct virtnet_info *vi = netdev_priv(dev);
> > + int napi_weight = 0;
> > + bool running;
> > + int i;
> > +
> > + if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> > + ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> > + napi_weight = NAPI_POLL_WEIGHT;
> > + }
> > +
> > + /* disallow changes to fields not explicitly tested above */
> > + if (memcmp(ec, &ec_default, sizeof(ec_default)))
> > + return -EINVAL;
> > +
> > + if (napi_weight ^ vi->sq[0].napi.weight) {
> > + running = netif_running(vi->dev);
> > +
> > + for (i = 0; i < vi->max_queue_pairs; i++) {
> > + vi->sq[i].napi.weight = napi_weight;
> > +
> > + if (!running)
> > + continue;
> > +
> > + if (napi_weight)
> > + virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> > + &vi->sq[i].napi);
> > + else
> > + napi_disable(&vi->sq[i].napi);
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int virtnet_get_coalesce(struct net_device *dev,
> > + struct ethtool_coalesce *ec)
> > +{
> > + const u32 tx_coalesce_napi_mask = (1 << 16);
> > + const struct ethtool_coalesce ec_default = {
> > + .cmd = ETHTOOL_GCOALESCE,
> > + .rx_max_coalesced_frames = 1,
> > + .tx_max_coalesced_frames = 1,
> > + };
> > + struct virtnet_info *vi = netdev_priv(dev);
> > +
> > + memcpy(ec, &ec_default, sizeof(ec_default));
> > +
> > + if (vi->sq[0].napi.weight)
> > + ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> > +
> > + return 0;
> > +}
>
> Looks good. Just one nit, maybe it's better simply check against zero?
I wanted to avoid making napi and interrupt moderation mutually
exclusive. If the virtio-net driver ever gets true moderation support,
it should be able to work alongside napi.
But I can make no-napi be 0 and napi be 1. That is future proof, in
the sense that napi is enabled if there is any interrupt moderation.
Powered by blists - more mailing lists