[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-KPJukmNmNBugONfPZGHtauuDSEsOFCz1_AysZApKnCxw@mail.gmail.com>
Date: Sun, 9 Sep 2018 19:07:56 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jason Wang <jasowang@...hat.com>
Cc: "Jon Olson (Google Drive)" <jonolson@...gle.com>,
"Michael S. Tsirkin" <mst@...hat.com>, caleb.raitto@...il.com,
David Miller <davem@...emloft.net>,
Network Development <netdev@...r.kernel.org>,
Caleb Raitto <caraitto@...gle.com>
Subject: Re: [PATCH net-next] virtio_net: force_napi_tx module param.
On Wed, Aug 29, 2018 at 9:01 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@...hat.com> wrote:
> >
> >
> >
> > On 2018年08月29日 03:57, Willem de Bruijn wrote:
> > > On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@...hat.com> wrote:
> > >>
> > >>
> > >> On 2018年07月25日 08:17, Jon Olson wrote:
> > >>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> > >>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > >>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> > >>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > >>>>>>> >From the above linked patch, I understand that there are yet
> > >>>>>>> other special cases in production, such as a hard cap on #tx queues to
> > >>>>>>> 32 regardless of number of vcpus.
> > >>>>>> I don't think upstream kernels have this limit - we can
> > >>>>>> now use vmalloc for higher number of queues.
> > >>>>> Yes. that patch* mentioned it as a google compute engine imposed
> > >>>>> limit. It is exactly such cloud provider imposed rules that I'm
> > >>>>> concerned about working around in upstream drivers.
> > >>>>>
> > >>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> > >>>> Yea. Why does GCE do it btw?
> > >>> There are a few reasons for the limit, some historical, some current.
> > >>>
> > >>> Historically we did this because of a kernel limit on the number of
> > >>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > >>> the limit upstream at the time we did it was actually eight. We had
> > >>> increased the limit from eight to 32 internally, and it appears in
> > >>> upstream it has subsequently increased upstream to 256. We no longer
> > >>> use TAP for networking, so that constraint no longer applies for us,
> > >>> but when looking at removing/raising the limit we discovered no
> > >>> workloads that clearly benefited from lifting it, and it also placed
> > >>> more pressure on our virtual networking stack particularly on the Tx
> > >>> side. We left it as-is.
> > >>>
> > >>> In terms of current reasons there are really two. One is memory usage.
> > >>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > >>> that the guest will have an Rx queue for every Tx queue. We run our
> > >>> individual virtqueues fairly deep (4096 entries) to give guests a wide
> > >>> time window for re-posting Rx buffers and avoiding starvation on
> > >>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > >>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > >>> be up to 512MB of memory posted for network buffers. Scaling this to
> > >>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > >>> all of the Rx rings full would (in the large average Rx packet size
> > >>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > >>> of RAM available, but I don't believe we've observed a situation where
> > >>> they would have benefited from having 2.5 gigs of buffers posted for
> > >>> incoming network traffic :)
> > >> We can work to have async txq and rxq instead of paris if there's a
> > >> strong requirement.
> > >>
> > >>> The second reason is interrupt related -- as I mentioned above, we
> > >>> have found no workloads that clearly benefit from so many queues, but
> > >>> we have found workloads that degrade. In particular workloads that do
> > >>> a lot of small packet processing but which aren't extremely latency
> > >>> sensitive can achieve higher PPS by taking fewer interrupt across
> > >>> fewer VCPUs due to better batching (this also incurs higher latency,
> > >>> but at the limit the "busy" cores end up suppressing most interrupts
> > >>> and spending most of their cycles farming out work). Memcache is a
> > >>> good example here, particularly if the latency targets for request
> > >>> completion are in the ~milliseconds range (rather than the
> > >>> microseconds we typically strive for with TCP_RR-style workloads).
> > >>>
> > >>> All of that said, we haven't been forthcoming with data (and
> > >>> unfortunately I don't have it handy in a useful form, otherwise I'd
> > >>> simply post it here), so I understand the hesitation to simply run
> > >>> with napi_tx across the board. As Willem said, this patch seemed like
> > >>> the least disruptive way to allow us to continue down the road of
> > >>> "universal" NAPI Tx and to hopefully get data across enough workloads
> > >>> (with VMs small, large, and absurdly large :) to present a compelling
> > >>> argument in one direction or another. As far as I know there aren't
> > >>> currently any NAPI related ethtool commands (based on a quick perusal
> > >>> of ethtool.h)
> > >> As I suggest before, maybe we can (ab)use tx-frames-irq.
> > > I forgot to respond to this originally, but I agree.
> > >
> > > How about something like the snippet below. It would be simpler to
> > > reason about if only allow switching while the device is down, but
> > > napi does not strictly require that.
> > >
> > > +static int virtnet_set_coalesce(struct net_device *dev,
> > > + struct ethtool_coalesce *ec)
> > > +{
> > > + const u32 tx_coalesce_napi_mask = (1 << 16);
> > > + const struct ethtool_coalesce ec_default = {
> > > + .cmd = ETHTOOL_SCOALESCE,
> > > + .rx_max_coalesced_frames = 1,
> > > + .tx_max_coalesced_frames = 1,
> > > + };
> > > + struct virtnet_info *vi = netdev_priv(dev);
> > > + int napi_weight = 0;
> > > + bool running;
> > > + int i;
> > > +
> > > + if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> > > + ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> > > + napi_weight = NAPI_POLL_WEIGHT;
> > > + }
> > > +
> > > + /* disallow changes to fields not explicitly tested above */
> > > + if (memcmp(ec, &ec_default, sizeof(ec_default)))
> > > + return -EINVAL;
> > > +
> > > + if (napi_weight ^ vi->sq[0].napi.weight) {
> > > + running = netif_running(vi->dev);
> > > +
> > > + for (i = 0; i < vi->max_queue_pairs; i++) {
> > > + vi->sq[i].napi.weight = napi_weight;
> > > +
> > > + if (!running)
> > > + continue;
> > > +
> > > + if (napi_weight)
> > > + virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> > > + &vi->sq[i].napi);
> > > + else
> > > + napi_disable(&vi->sq[i].napi);
> > > + }
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int virtnet_get_coalesce(struct net_device *dev,
> > > + struct ethtool_coalesce *ec)
> > > +{
> > > + const u32 tx_coalesce_napi_mask = (1 << 16);
> > > + const struct ethtool_coalesce ec_default = {
> > > + .cmd = ETHTOOL_GCOALESCE,
> > > + .rx_max_coalesced_frames = 1,
> > > + .tx_max_coalesced_frames = 1,
> > > + };
> > > + struct virtnet_info *vi = netdev_priv(dev);
> > > +
> > > + memcpy(ec, &ec_default, sizeof(ec_default));
> > > +
> > > + if (vi->sq[0].napi.weight)
> > > + ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> > > +
> > > + return 0;
> > > +}
> >
> > Looks good. Just one nit, maybe it's better simply check against zero?
>
> I wanted to avoid making napi and interrupt moderation mutually
> exclusive. If the virtio-net driver ever gets true moderation support,
> it should be able to work alongside napi.
>
> But I can make no-napi be 0 and napi be 1. That is future proof, in
> the sense that napi is enabled if there is any interrupt moderation.
It's not appearing on patchwork yet, but I just sent a patch.
I implemented the above, but .tx_frames of 0 is technically incorrect
and it would unnecessarily constrain interrupt moderation to one of two
modes. I went back to using a high bit. That said, if you feel strongly
I'll change it.
I also tried various ways of switching between napi and non napi
mode without bringing the device down. This is quite fragile. At the
very least napi.weight has to be updated without any interrupt or
napi callback happening in between. So most of the datapath needs
to be quiesced.
I did code up a variant that manually stops all the queues, masks the
interrupt and waits for napi to complete if scheduled. But in a stress
test it still managed to trigger a BUG in napi_enable on this state.
Napi is not switched at runtime in other devices, nor really needed. So
instead I made this change conditional on the device being down.
Powered by blists - more mailing lists