netdev - Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALzJLG-OcTAKkRQQHXV=8N-m+yATaX76UpGE4eWgK_8N52YePg@mail.gmail.com>
Date:   Sun, 26 Feb 2017 18:32:16 +0200
From:   Saeed Mahameed <saeedm@....mellanox.co.il>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     David Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>
Subject: Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> While playing with hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> I could track this down to the cx-3 (mlx4 driver) card that was
> apparently sending an IRQ before we would arm it.
>

Hi Eric,

This is highly unlikely, the hardware should not do that, and if this
is really the case
we need to hunt down the root cause and not work around it.

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> This patch adds a new rx_irq_miss field that is incremented every time
> the hard IRQ handler could not grab NAPI_STATE_SCHED.
>
> Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Are you sure this is not some kind of race condition with the busy
polling mechanism
Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
Maybe the busy polling thread when it detects that it wants to yield,
it arms the CQ too early (when napi is not ready)?

Anyway Tariq and I would like to further investigate the fired IRQ
while CQ is not armed.  It smells
like  a bug in the driver/napi level, it is not a HW expected behavior.

Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
counter advance on a linerate RX load ?

>
> Note that this work around would probably not work if the IRQ is spread
> over many cpus, since it assume the hard irq and softirq are handled by
> the same cpu. This kind of setup is buggy anyway because of reordering
> issues.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Tariq Toukan <tariqt@...lanox.com>
> Cc: Saeed Mahameed <saeedm@...lanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    1
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
>         struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
>         struct mlx4_en_priv *priv = netdev_priv(cq->dev);
>
> -       if (likely(priv->port_up))
> -               napi_schedule_irqoff(&cq->napi);
> -       else
> +       if (likely(priv->port_up)) {
> +               if (napi_schedule_prep(&cq->napi))
> +                       __napi_schedule_irqoff(&cq->napi);
> +               else
> +                       cq->rx_irq_missed++;
> +       } else {
>                 mlx4_en_arm_cq(priv, cq);
> +       }
>  }
>
>  /* Rx CQ polling - called by NAPI */
> @@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>         struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
>         struct net_device *dev = cq->dev;
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> -       int done;
> +       int done = 0;
> +       u32 rx_irq_missed;
>
> -       done = mlx4_en_process_rx_cq(dev, cq, budget);
> +again:
> +       rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
> +       done += mlx4_en_process_rx_cq(dev, cq, budget - done);
>
>         /* If we used up all the quota - we're probably not done yet... */
>         if (done == budget) {
> @@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
>                  */
>                 if (done)
>                         done--;
> +               if (napi_complete_done(napi, done))
> +                       mlx4_en_arm_cq(priv, cq);
> +               return done;
>         }
> -       /* Done for now */
> -       if (napi_complete_done(napi, done))
> +       if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
> +               goto again;
> +
> +       if (napi_complete_done(napi, done)) {
>                 mlx4_en_arm_cq(priv, cq);
> +
> +               /* We might have received an interrupt too soon */
> +               if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
> +                   napi_reschedule(napi))
> +                       goto again;
> +       }
>         return done;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -369,6 +369,7 @@ struct mlx4_en_cq {
>         int                     ring;
>         struct net_device      *dev;
>         struct napi_struct      napi;
> +       u32                     rx_irq_missed;
>         int size;
>         int buf_size;
>         int vector;
>
>