lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 30 Nov 2016 07:44:21 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Saeed Mahameed <saeedm@....mellanox.co.il>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        Rick Jones <rick.jones2@....com>,
        Linux Netdev List <netdev@...r.kernel.org>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Tariq Toukan <tariqt@...lanox.com>
Subject: Re: [WIP] net+mlx4: auto doorbell

On Wed, 2016-11-30 at 15:50 +0200, Saeed Mahameed wrote:
> On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
> >
> >
> >> Not sure it this has been tried before, but the doorbell avoidance could
> >> be done by the driver itself, because it knows a TX completion will come
> >> shortly (well... if softirqs are not delayed too much !)
> >>
> >> Doorbell would be forced only if :
> >>
> >> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
> >> OR
> >> ( too many [1] packets were put in TX ring buffer, no point deferring
> >> more)
> >>
> >> Start the pump, but once it is started, let the doorbells being done by
> >> TX completion.
> >>
> >> ndo_start_xmit and TX completion handler would have to maintain a shared
> >> state describing if packets were ready but doorbell deferred.
> >>
> >>
> >> Note that TX completion means "if at least one packet was drained",
> >> otherwise busy polling, constantly calling napi->poll() would force a
> >> doorbell too soon for devices sharing a NAPI for both RX and TX.
> >>
> >> But then, maybe busy poll would like to force a doorbell...
> >>
> >> I could try these ideas on mlx4 shortly.
> >>
> >>
> >> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
> >
> > I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> > not used.
> 
> Hi Eric, Nice Idea indeed and we need something like this,
> today we almost don't exploit the TX bulking at all.
> 
> But please see below, i am not sure different contexts should share
> the doorbell ringing, it is really risky.
> 
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
> >  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
> >  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
> >  include/linux/netdevice.h                    |    1
> >  net/core/net-sysfs.c                         |   18 +++
> >  5 files changed, 83 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 6562f78b07f4..fbea83218fc0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >
> >         if (polled) {
> >                 if (doorbell_pending)
> > -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> > +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
> >
> >                 mlx4_cq_set_ci(&cq->mcq);
> >                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > index 4b597dca5c52..affebb435679 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> > @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
> >         ring->size = size;
> >         ring->size_mask = size - 1;
> >         ring->sp_stride = stride;
> > -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> > +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
> >
> >         tmp = size * sizeof(struct mlx4_en_tx_info);
> >         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> > @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
> >         ring->sp_cqn = cq;
> >         ring->prod = 0;
> >         ring->cons = 0xffffffff;
> > +       ring->ncons = 0;
> >         ring->last_nr_txbb = 1;
> >         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
> >         memset(ring->buf, 0, ring->buf_size);
> > @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
> >                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
> >  }
> >
> > -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> > +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
> >  {
> > -       return ring->prod - ring->cons > ring->full_size;
> > +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
> >  }
> >
> >  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> > @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >
> >         /* Skip last polled descriptor */
> >         ring->cons += ring->last_nr_txbb;
> > +       ring->ncons += ring->last_nr_txbb;
> >         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
> >                  ring->cons, ring->prod);
> >
> > @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >                                                 !!(ring->cons & ring->size), 0,
> >                                                 0 /* Non-NAPI caller */);
> >                 ring->cons += ring->last_nr_txbb;
> > +               ring->ncons += ring->last_nr_txbb;
> >                 cnt++;
> >         }
> >
> > @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
> >         return cnt;
> >  }
> >
> > +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> > +                          struct mlx4_en_tx_ring *ring)
> > +{
> > +
> > +       if (dev->doorbell_opt & 1) {
> > +               u32 oval = READ_ONCE(ring->prod_bell);
> > +               u32 nval = READ_ONCE(ring->prod);
> > +
> > +               if (oval == nval)
> > +                       return;
> > +
> > +               /* I can not tell yet if a cmpxchg() is needed or not */
> > +               if (dev->doorbell_opt & 2)
> > +                       WRITE_ONCE(ring->prod_bell, nval);
> > +               else
> > +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> > +                               return;
> > +       }
> > +       /* Since there is no iowrite*_native() that writes the
> > +        * value as is, without byteswapping - using the one
> > +        * the doesn't do byteswapping in the relevant arch
> > +        * endianness.
> > +        */
> > +#if defined(__LITTLE_ENDIAN)
> > +       iowrite32(
> > +#else
> > +       iowrite32be(
> > +#endif
> > +                 ring->doorbell_qpn,
> > +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > +}
> > +
> >  static bool mlx4_en_process_tx_cq(struct net_device *dev,
> >                                   struct mlx4_en_cq *cq, int napi_budget)
> >  {
> > @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> >         wmb();
> >
> >         /* we want to dirty this cache line once */
> > -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> > -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> > +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> > +       ring_cons += txbbs_skipped;
> > +       WRITE_ONCE(ring->cons, ring_cons);
> > +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> > +
> > +       if (dev->doorbell_opt)
> > +               mlx4_en_xmit_doorbell(dev, ring);
> >
> >         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
> >                 return done < budget;
> > @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
> >         __iowrite64_copy(dst, src, bytecnt / 8);
> >  }
> >
> > -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> > -{
> > -       wmb();
> 
> you missed/removed this "wmb()" in the new function, why ? where did
> you compensate for it ?

I removed it because I had a cmpxchg() there if the barrier was needed.

My patch is a WIP, where you can set the bit 2 to ask to replace the
cmpxchg() by a simple write, only for performance testing/comparisons.


> 
> > -       /* Since there is no iowrite*_native() that writes the
> > -        * value as is, without byteswapping - using the one
> > -        * the doesn't do byteswapping in the relevant arch
> > -        * endianness.
> > -        */
> > -#if defined(__LITTLE_ENDIAN)
> > -       iowrite32(
> > -#else
> > -       iowrite32be(
> > -#endif
> > -                 ring->doorbell_qpn,
> > -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> > -}
> >
> >  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >                                   struct mlx4_en_tx_desc *tx_desc,
> >                                   union mlx4_wqe_qpn_vlan qpn_vlan,
> >                                   int desc_size, int bf_index,
> >                                   __be32 op_own, bool bf_ok,
> > -                                 bool send_doorbell)
> > +                                 bool send_doorbell,
> > +                                 const struct net_device *dev, int nr_txbb)
> >  {
> >         tx_desc->ctrl.qpn_vlan = qpn_vlan;
> >
> > @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >
> >                 wmb();
> >
> > +               ring->prod += nr_txbb;
> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
> >                              desc_size);
> >
> > @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
> >                  */
> >                 dma_wmb();
> >                 tx_desc->ctrl.owner_opcode = op_own;
> > +               ring->prod += nr_txbb;
> >                 if (send_doorbell)
> > -                       mlx4_en_xmit_doorbell(ring);
> > +                       mlx4_en_xmit_doorbell(dev, ring);
> >                 else
> >                         ring->xmit_more++;
> >         }
> > @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
> >         }
> >
> > -       ring->prod += nr_txbb;
> > -
> >         /* If we used a bounce buffer then copy descriptor back into place */
> >         if (unlikely(bounce))
> >                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> > @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
> >         }
> >         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> >
> > +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> > +        * will happen shortly.
> > +        */
> > +       if (send_doorbell &&
> > +           dev->doorbell_opt &&
> > +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> 
> Aelexi already expressed his worries about synchronization, and i
> think here (in this exact line) sits the problem,
> what about if exactly at this point the TX completion handler just
> finished and rang the last doorbell,
> you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
> the last doorbell from the CQ handler missed it.
> even if you wrote the TX descriptor before the doorbell decision here,
> you will need a memory barrier to make sure the HW sees
> the new packet, which was typically done before ringing the doorbell.


My patch is a WIP, meaning it is not completed ;)

Surely we can find a non racy way to handle this.


> All in all, this is risky business :),  the right way to go is to
> force the upper layer to use xmit-more and delay doorbells/use bulking
> but from the same context
> (xmit routine).  For example see Achiad's suggestion (attached in
> Jesper's response), he used stop queue to force the stack to queue up
> packets (TX bulking)
> which would set xmit-more and will use the next completion to release
> the "stopped" ring TXQ rather than hit the doorbell on behalf of it.



Well, you depend on having a higher level queue like a qdisc.

Some users do not use a qdisc.
If you stop the queue, they no longer can send anything -> drops.


T

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ