[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAADnVQKjGykDYuCS=LQJ3g0brWACpMyaKjgQ9qm4szxHOLXV=A@mail.gmail.com>
Date: Wed, 5 Jan 2022 12:55:40 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Alexander Lobakin <alexandr.lobakin@...el.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
bpf <bpf@...r.kernel.org>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Network Development <netdev@...r.kernel.org>,
"Karlsson, Magnus" <magnus.karlsson@...el.com>,
Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH bpf-next v2 3/4] ice: xsk: improve AF_XDP ZC Tx and use
batching API
On Thu, Dec 30, 2021 at 8:09 AM Alexander Lobakin
<alexandr.lobakin@...el.com> wrote:
>
> From: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
> Date: Thu, 30 Dec 2021 14:13:10 +0100
>
> > On Wed, Dec 29, 2021 at 02:11:31PM +0100, Alexander Lobakin wrote:
> > > From: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
> > > Date: Thu, 16 Dec 2021 14:59:57 +0100
> > >
> > > > Follow mostly the logic from commit 9610bd988df9 ("ice: optimize XDP_TX
> > > > workloads") that has been done in order to address the massive tx_busy
> > > > statistic bump and improve the performance as well.
> > > >
> > > > Increase the ICE_TX_THRESH to 64 as it seems to work out better for both
> > > > XDP and AF_XDP. Also, separating the stats structs onto separate cache
> > > > lines seemed to improve the performance. Batching approach is inspired
> > > > by i40e's implementation with adjustments to the cleaning logic.
> > > >
> > > > One difference from 'xdpdrv' XDP_TX is when ring has less than
> > > > ICE_TX_THRESH free entries, the cleaning routine will not stop after
> > > > cleaning a single ICE_TX_THRESH amount of descs but rather will forward
> > > > the next_dd pointer and check the DD bit and for this bit being set the
> > > > cleaning will be repeated. IOW clean until there are descs that can be
> > > > cleaned.
> > > >
> > > > It takes three separate xdpsock instances in txonly mode to achieve the
> > > > line rate and this was not previously possible.
> > > >
> > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
> > > > ---
> > > > drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
> > > > drivers/net/ethernet/intel/ice/ice_txrx.h | 4 +-
> > > > drivers/net/ethernet/intel/ice/ice_xsk.c | 249 ++++++++++++++--------
> > > > drivers/net/ethernet/intel/ice/ice_xsk.h | 26 ++-
> > > > 4 files changed, 182 insertions(+), 99 deletions(-)
> > > >
> > >
> > > -- 8< --
> > >
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
> > > > index 4c7bd8e9dfc4..f2eb99063c1f 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
> > > > @@ -6,19 +6,36 @@
> > > > #include "ice_txrx.h"
> > > > #include "ice.h"
> > > >
> > > > +#define PKTS_PER_BATCH 8
> > > > +
> > > > +#ifdef __clang__
> > > > +#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
> > > > +#elif __GNUC__ >= 4
> > > > +#define loop_unrolled_for _Pragma("GCC unroll 8") for
> > > > +#else
> > > > +#define loop_unrolled_for for
> > > > +#endif
> > >
> > > It's used in a bunch more places across the tree, what about
> > > defining that in linux/compiler{,_clang,_gcc}.h?
> > > Is it possible to pass '8' as an argument? Like
> >
> > Like where besides i40e? I might currently suck at grepping, let's blame
> > christmas break for that.
>
> Ah okay, I confused it with a work around this pragma here: [0]
>
> >
> > If there are actually other callsites besides i40e then this is a good
> > idea to me, maybe as a follow-up?
>
> I think there are more potential call sites for that to come, I'd
> make linux/unroll.h in the future I guess. But not as a part of
> this series, right.
Please don't, since loop unroll pragma is a hint.
The compilers don't have to actually do the unroll.
Both gcc and clang try to do it when it looks ok-ish from
compiler perspective, but they don't have to.
Try large unroll values and check the code.
Ideally add compiler debug flags, so it can tell what it's actually doing.
It's hard to figure out loop unroll factor looking at the assembly.
Powered by blists - more mailing lists