[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALzJLG-OcTAKkRQQHXV=8N-m+yATaX76UpGE4eWgK_8N52YePg@mail.gmail.com>
Date: Sun, 26 Feb 2017 18:32:16 +0200
From: Saeed Mahameed <saeedm@....mellanox.co.il>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Tariq Toukan <tariqt@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>
Subject: Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker
On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> While playing with hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> I could track this down to the cx-3 (mlx4 driver) card that was
> apparently sending an IRQ before we would arm it.
>
Hi Eric,
This is highly unlikely, the hardware should not do that, and if this
is really the case
we need to hunt down the root cause and not work around it.
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> This patch adds a new rx_irq_miss field that is incremented every time
> the hard IRQ handler could not grab NAPI_STATE_SCHED.
>
> Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED
Are you sure this is not some kind of race condition with the busy
polling mechanism
Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
Maybe the busy polling thread when it detects that it wants to yield,
it arms the CQ too early (when napi is not ready)?
Anyway Tariq and I would like to further investigate the fired IRQ
while CQ is not armed. It smells
like a bug in the driver/napi level, it is not a HW expected behavior.
Any pointers on how to reproduce ? how often would the "rx_irq_miss"
counter advance on a linerate RX load ?
>
> Note that this work around would probably not work if the IRQ is spread
> over many cpus, since it assume the hard irq and softirq are handled by
> the same cpu. This kind of setup is buggy anyway because of reordering
> issues.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Tariq Toukan <tariqt@...lanox.com>
> Cc: Saeed Mahameed <saeedm@...lanox.com>
> ---
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 32 +++++++++++++----
> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 1
> 2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
> struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
> struct mlx4_en_priv *priv = netdev_priv(cq->dev);
>
> - if (likely(priv->port_up))
> - napi_schedule_irqoff(&cq->napi);
> - else
> + if (likely(priv->port_up)) {
> + if (napi_schedule_prep(&cq->napi))
> + __napi_schedule_irqoff(&cq->napi);
> + else
> + cq->rx_irq_missed++;
> + } else {
> mlx4_en_arm_cq(priv, cq);
> + }
> }
>
> /* Rx CQ polling - called by NAPI */
> @@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
> struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
> struct net_device *dev = cq->dev;
> struct mlx4_en_priv *priv = netdev_priv(dev);
> - int done;
> + int done = 0;
> + u32 rx_irq_missed;
>
> - done = mlx4_en_process_rx_cq(dev, cq, budget);
> +again:
> + rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
> + done += mlx4_en_process_rx_cq(dev, cq, budget - done);
>
> /* If we used up all the quota - we're probably not done yet... */
> if (done == budget) {
> @@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
> */
> if (done)
> done--;
> + if (napi_complete_done(napi, done))
> + mlx4_en_arm_cq(priv, cq);
> + return done;
> }
> - /* Done for now */
> - if (napi_complete_done(napi, done))
> + if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
> + goto again;
> +
> + if (napi_complete_done(napi, done)) {
> mlx4_en_arm_cq(priv, cq);
> +
> + /* We might have received an interrupt too soon */
> + if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
> + napi_reschedule(napi))
> + goto again;
> + }
> return done;
> }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -369,6 +369,7 @@ struct mlx4_en_cq {
> int ring;
> struct net_device *dev;
> struct napi_struct napi;
> + u32 rx_irq_missed;
> int size;
> int buf_size;
> int vector;
>
>
Powered by blists - more mailing lists