[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKgT0UeEL3W42eDqSt97xnn3tXDtWMf4sdPByAtvbx=Z7Sx7hQ@mail.gmail.com>
Date: Tue, 23 Apr 2019 13:12:58 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Saeed Mahameed <saeedm@...lanox.com>,
"davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
Tariq Toukan <tariqt@...lanox.com>, "bsd@...com" <bsd@...com>
Subject: Re: [net-next 01/14] net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES
On Tue, Apr 23, 2019 at 11:25 AM Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
>
> On Tue, 23 Apr 2019 10:27:32 -0700
> Alexander Duyck <alexander.duyck@...il.com> wrote:
>
> > On Tue, Apr 23, 2019 at 9:42 AM Saeed Mahameed <saeedm@...lanox.com> wrote:
> > >
> > > On Tue, 2019-04-23 at 08:21 -0700, Alexander Duyck wrote:
> > > > On Tue, Apr 23, 2019 at 6:23 AM Jesper Dangaard Brouer
> > > > <brouer@...hat.com> wrote:
> > > > > On Mon, 22 Apr 2019 19:46:47 -0700
> > > > > Jakub Kicinski <jakub.kicinski@...ronome.com> wrote:
> > > > >
> > > > > > On Mon, 22 Apr 2019 15:32:53 -0700, Saeed Mahameed wrote:
> > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > > > > > > b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > > > > > > index 51e109fdeec1..6147be23a9b9 100644
> > > > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > > > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > > > > > > @@ -50,6 +50,7 @@
> > > > > > > #include <net/xdp.h>
> > > > > > > #include <linux/net_dim.h>
> > > > > > > #include <linux/bits.h>
> > > > > > > +#include <linux/prefetch.h>
> > > > > > > #include "wq.h"
> > > > > > > #include "mlx5_core.h"
> > > > > > > #include "en_stats.h"
> > > > > > > @@ -986,6 +987,22 @@ static inline void mlx5e_cq_arm(struct
> > > > > > > mlx5e_cq *cq)
> > > > > > > mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, cq-
> > > > > > > >wq.cc);
> > > > > > > }
> > > > > > >
> > > > > > > +static inline void mlx5e_prefetch(void *p)
> > > > > > > +{
> > > > > > > + prefetch(p);
> > > > > > > +#if L1_CACHE_BYTES < 128
> > > > > > > + prefetch(p + L1_CACHE_BYTES);
> > > > > > > +#endif
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline void mlx5e_prefetchw(void *p)
> > > > > > > +{
> > > > > > > + prefetchw(p);
> > > > > > > +#if L1_CACHE_BYTES < 128
> > > > > > > + prefetchw(p + L1_CACHE_BYTES);
> > > > > > > +#endif
> > > > > > > +}
> > > > > >
> > > > > > All Intel drivers do the exact same thing, perhaps it's time to
> > > > > > add a
> > > > > > helper fot this?
> > > > > >
> > > > > > net_prefetch_headers()
> > > > > >
> > > > > > or some such?
> > > > >
> > > > > I wonder if Tariq measured any effect from doing this?
> > > > >
> > > > > Because Intel CPUs will usually already prefetch the next cache-
> > > > > line,
> > > > > as described in [1], you can even read (and modify) this MSR 0x1A4
> > > > > e.g. via tools in [2]. Maybe Intel guys added it before this was
> > > > > done
> > > > > in HW, and never cleaned it up?
> > > > >
> > > > > [1]
> > > > > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
> > > >
> > > > The issue is the adjacent cache line prefetcher can be on or off and
> > > > a
> > > > network driver shouldn't really be going through and twiddling those
> > > > sort of bits. In some cases having it on can result in more memory
> > > > being consumed then is needed. The reason why I enabled the
> > > > additional
> > > > cacheline prefetch for the Intel NICs is because most TCP packets are
> > > > at a minimum 68 bytes for just the headers so there was an advantage
> > > > for TCP traffic to make certain we prefetched at least enough for us
> > > > to process the headers.
> > > >
> > >
> > > So if L2 adjacent cache line prefetcher bit is enabled then this
>
> Nitpick: is it the DCU prefetcher bit that "Fetches the next cache line
> into L1-D cache" in the link[1].
I didn't think the DCU prefetcher kicked in unless you were
sequentially accessing data in a recently fetched cache line. So it
doesn't apply in this case. The definition of it from the x86 software
optimization manual says as much:
Data cache unit (DCU) prefetcher. This prefetcher, also known as the
streaming prefetcher, is
triggered by an ascending access to very recently loaded data. The
processor assumes that this
access is part of a streaming algorithm and automatically fetches the next line.
So basically unless we are doing sequental access the DCU prefetcher
won't kick in, and we aren't really sequential since we are accessing
headers so things get a bit out of order. Based on all this I would
guess the 2 prefetches are needed for x86 if you want a full TCP or
IPv6 header pulled into the L1 cache for instance.
Powered by blists - more mailing lists