[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210429144910.27aebab2@carbon>
Date: Thu, 29 Apr 2021 14:49:10 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>
Cc: Lorenzo Bianconi <lorenzo@...nel.org>, bpf <bpf@...r.kernel.org>,
Network Development <netdev@...r.kernel.org>,
Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, shayagr@...zon.com,
sameehj@...zon.com, John Fastabend <john.fastabend@...il.com>,
David Ahern <dsahern@...nel.org>,
Eelco Chaudron <echaudro@...hat.com>,
Jason Wang <jasowang@...hat.com>,
Alexander Duyck <alexander.duyck@...il.com>,
Saeed Mahameed <saeed@...nel.org>,
"Fijalkowski, Maciej" <maciej.fijalkowski@...el.com>,
Tirthendu <tirthendu.sarkar@...el.com>, brouer@...hat.com
Subject: Re: [PATCH v8 bpf-next 00/14] mvneta: introduce XDP multi-buffer
support
On Wed, 28 Apr 2021 09:41:52 +0200
Magnus Karlsson <magnus.karlsson@...il.com> wrote:
> On Tue, Apr 27, 2021 at 8:28 PM Lorenzo Bianconi <lorenzo@...nel.org> wrote:
> >
> > [...]
> >
> > > Took your patches for a test run with the AF_XDP sample xdpsock on an
> > > i40e card and the throughput degradation is between 2 to 6% depending
> > > on the setup and microbenchmark within xdpsock that is executed. And
> > > this is without sending any multi frame packets. Just single frame
> > > ones. Tirtha made changes to the i40e driver to support this new
> > > interface so that is being included in the measurements.
> > >
> > > What performance do you see with the mvneta card? How much are we
> > > willing to pay for this feature when it is not being used or can we in
> > > some way selectively turn it on only when needed?
> >
> > Hi Magnus,
> >
> > Today I carried out some comparison tests between bpf-next and bpf-next +
> > xdp_multibuff series on mvneta running xdp_rxq_info sample. Results are
> > basically aligned:
> >
> > bpf-next:
> > - xdp drop ~ 665Kpps
> > - xdp_tx ~ 291Kpps
> > - xdp_pass ~ 118Kpps
> >
> > bpf-next + xdp_multibuff:
> > - xdp drop ~ 672Kpps
> > - xdp_tx ~ 288Kpps
> > - xdp_pass ~ 118Kpps
> >
> > I am not sure if results are affected by the low power CPU, I will run some
> > tests on ixgbe card.
>
> Thanks Lorenzo. I made some new runs, this time with i40e driver
> changes as a new data point. Same baseline as before but with patches
> [1] and [2] applied. Note
> that if you use net or net-next and i40e, you need patch [3] too.
>
> The i40e multi-buffer support will be posted on the mailing list as a
> separate RFC patch so you can reproduce and review.
>
> Note, calculations are performed on non-truncated numbers. So 2 ns
> might be 5 cycles on my 2.1 GHz machine since 2.49 ns * 2.1 GHz =
> 5.229 cycles ~ 5 cycles. xdpsock is run in zero-copy mode so it uses
> the zero-copy driver data path in contrast with xdp_rxq_info that uses
> the regular driver data path. Only ran the busy-poll 1-core case this
> time. Reported numbers are the average over 3 runs.
Yes, for i40e the xdpsock zero-copy test uses another code path, this
is something we need to keep in mind.
Also remember that we designed the central xdp_do_redirect() call to
delay creation of xdp_frame. This is something what AF_XDP ZC takes
advantage of.
Thus, the cost of xdp_buff to xdp_frame conversion is not covered in
below tests, and I expect this patchset to increase that cost...
(UPDATE: below XDP_TX actually does xdp_frame conversion)
> multi-buffer patches without any driver changes:
Thanks you *SO* much Magnus for these superb tests. I absolutely love
how comprehensive your test results are. Thanks you for catching the
performance regression in this patchset. (I for one know how time
consuming these kind of tests are, I appreciate your effort, a lot!)
> xdpsock rxdrop 1-core:
> i40e: -4.5% in throughput / +3 ns / +6 cycles
> ice: -1.5% / +1 ns / +2 cycles
>
> xdp_rxq_info -a XDP_DROP
> i40e: -2.5% / +2 ns / +3 cycles
> ice: +6% / -3 ns / -7 cycles
>
> xdp_rxq_info -a XDP_TX
> i40e: -10% / +15 ns / +32 cycles
> ice: -9% / +14 ns / +29 cycles
This is a clear performance regression.
Looking closer at driver i40e_xmit_xdp_tx_ring() actually performs a
xdp_frame conversion calling xdp_convert_buff_to_frame(xdp).
FYI: We have started an offlist thread on finding the root-cause and
on IRC with Lorenzo. The current lead is that, as Alexei so wisely
pointed out in earlier patches, that struct bit access is not
efficient...
As I expect we soon need bits for HW RX checksum indication, and
indication if metadata contains BTF described area, I've asked Lorenzo
to consider this, and look into introducing a flags member. (Then we
just have to figure out how to make flags access efficient).
> multi-buffer patches + i40e driver changes from Tirtha:
>
> xdpsock rxdrop 1-core:
> i40e: -3% / +2 ns / +3 cycles
>
> xdp_rxq_info -a XDP_DROP
> i40e: -7.5% / +5 ns / +9 cycles
>
> xdp_rxq_info -a XDP_TX
> i40e: -10% / +15 ns / +32 cycles
>
> Would be great if someone could rerun a similar set of experiments on
> i40e or ice then
> report.
> [1] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210419/024106.html
> [2] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024135.html
> [3] https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210426/024129.html
I'm very happy that you/we all are paying attention to keep XDP
performance intact, as small 'paper-cuts' like +32 cycles does affect
XDP in the long run. Happy performance testing everybody :-)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists