[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140826084455.28dd4058@redhat.com>
Date: Tue, 26 Aug 2014 08:44:55 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Alexander Duyck <alexander.h.duyck@...el.com>
Cc: Daniel Borkmann <dborkman@...hat.com>, davem@...emloft.net,
netdev@...r.kernel.org, brouer@...hat.com
Subject: Re: [RFC PATCH net-next 1/3] ixgbe: support
netdev_ops->ndo_xmit_flush()
On Mon, 25 Aug 2014 15:51:50 -0700
Alexander Duyck <alexander.h.duyck@...el.com> wrote:
> On 08/25/2014 05:07 AM, Jesper Dangaard Brouer wrote:
> > On Sun, 24 Aug 2014 15:42:16 +0200
> > Daniel Borkmann <dborkman@...hat.com> wrote:
> >
> >> This implements the deferred tail pointer flush API for the ixgbe
> >> driver. Similar version also proposed longer time ago by Alexander Duyck.
> >
> > I've run some benchmarks with this patch only, which actually shows a
> > performance regression.
> >
> > Using trafgen with QDISC_BYPASS and mmap mode, via cmdline:
> > trafgen --cpp --dev eth5 --conf udp_example01.trafgen -V --cpus 1
> >
> > BASELINE(no-patch): trafgen QDISC_BYPASS and mmap:
> > - tx:1562539 pps
> >
> > (This patch only): ixgbe use of .ndo_xmit_flush.
> > - tx:1532299 pps
> >
> > Regression: -30240 pps
> > * In nanosec: (1/1562539*10^9)-(1/1532299*10^9) = -12.63 ns
> >
> >
> > As DaveM points out, me might not need the mmiowb().
> > Result when not performing the mmiowb():
> > - tx:1548352 pps
> >
> > Still a small regression: -14187 pps
> > * In nanosec: (1/1562539*10^9)-(1/1548352*10^9) = -5.86 ns
> >
> >
> > I was not expecting this "slowdown", with this rather simple use of the
> > new ndo_xmit_flush API. Can anyone explain why this is happening?
>
> One possibility is that we are now doing less stuff between the time we
> write tail and when we grab the qdisc lock (locked transactions are
> stalled by MMIO) so that we are spending more time stuck waiting for the
> write to complete and doing nothing.
In this testcase we are bypassing the qdisc code path, but still taking
the HARD_TX_LOCK. I were only expecting in the area of -2ns due to the
extra function call overhead.
But when we start to include the qdisc code path, then the performance
regression gets even worse. I would like an explanation for that, see:
http://thread.gmane.org/gmane.linux.network/327254/focus=327431
> Then of course there are always the funny oddball quirks such as the
> code changes might have changed the alignment of a loop resulting in Tx
> cleanup more expensive than it was before.
Yes, this is when it gets hairy!
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists