[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZJVPMyO/gBkk1OT0@corigine.com>
Date: Fri, 23 Jun 2023 09:52:19 +0200
From: Simon Horman <simon.horman@...igine.com>
To: Nick Child <nnac123@...ux.ibm.com>
Cc: netdev@...r.kernel.org, haren@...ux.ibm.com, ricklind@...ibm.com
Subject: Re: [PATCH net] ibmvnic: Do not reset dql stats on NON_FATAL err
+ maintainers and blamed authors
On Thu, Jun 22, 2023 at 02:03:32PM -0500, Nick Child wrote:
> All ibmvnic resets, make a call to netdev_tx_reset_queue() when
> re-opening the device. netdev_tx_reset_queue() resets the num_queued
> and num_completed byte counters. These stats are used in Byte Queue
> Limit (BQL) algorithms. The difference between these two stats tracks
> the number of bytes currently sitting on the physical NIC. ibmvnic
> increases the number of queued bytes though calls to
> netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports
> that it is done transmitting bytes, the ibmvnic device increases the
> number of completed bytes through calls to netdev_tx_completed_queue().
> It is important to note that the driver batches its transmit calls and
> num_queued is increased every time that an skb is added to the next
> batch, not necessarily when the batch is sent to VIOS for transmission.
>
> Unlike other reset types, a NON FATAL reset will not flush the sub crq
> tx buffers. Therefore, it is possible for the batched skb array to be
> partially full. So if there is call to netdev_tx_reset_queue() when
> re-opening the device, the value of num_queued (0) would not account
> for the skb's that are currently batched. Eventually, when the batch
> is sent to VIOS, the call to netdev_tx_completed_queue() would increase
> num_completed to a value greater than the num_queued. This causes a
> BUG_ON crash:
>
> ibmvnic 30000002: Firmware reports error, cause: adapter problem.
> Starting recovery...
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ------------[ cut here ]------------
> kernel BUG at lib/dynamic_queue_limits.c:27!
> Oops: Exception in kernel mode, sig: 5
> [....]
> NIP dql_completed+0x28/0x1c0
> LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic]
> Call Trace:
> ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable)
> ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic]
> __handle_irq_event_percpu+0x98/0x270
> ---[ end trace ]---
>
> Therefore, do not reset the dql stats when performing a NON_FATAL reset.
> Simply clearing the queues off bit is sufficient.
>
> Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls")
> Signed-off-by: Nick Child <nnac123@...ux.ibm.com>
> ---
> drivers/net/ethernet/ibm/ibmvnic.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index c63d3ec9d328..5523ab52ff2b 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1816,7 +1816,18 @@ static int __ibmvnic_open(struct net_device *netdev)
> if (prev_state == VNIC_CLOSED)
> enable_irq(adapter->tx_scrq[i]->irq);
> enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> - netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> + /* netdev_tx_reset_queue will reset dql stats and set the stacks
> + * flag for queue status. During NON_FATAL resets, just
> + * clear the bit, don't reset the stats because there could
> + * be batched skb's waiting to be sent. If we reset dql stats,
> + * we risk num_completed being greater than num_queued.
> + * This will cause a BUG_ON in dql_completed().
> + */
> + if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
> + clear_bit(__QUEUE_STATE_STACK_XOFF,
> + &netdev_get_tx_queue(netdev, i)->state);
> + else
> + netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> }
>
> rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> --
> 2.31.1
>
>
Powered by blists - more mailing lists