[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZJWS8coDN71C/oeE@corigine.com>
Date: Fri, 23 Jun 2023 14:41:21 +0200
From: Simon Horman <simon.horman@...igine.com>
To: Nick Child <nnac123@...ux.ibm.com>
Cc: netdev@...r.kernel.org, haren@...ux.ibm.com, ricklind@...ibm.com,
mpe@...erman.id.au, kuba@...nel.org, tlfalcon@...ux.ibm.com,
danymadden@...ibm.com, npiggin@...il.com,
christophe.leroy@...roup.eu, davem@...emloft.net, pabeni@...hat.com,
edumazet@...gle.com, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH net] ibmvnic: Do not reset dql stats on NON_FATAL err
On Fri, Jun 23, 2023 at 09:52:26AM +0200, Simon Horman wrote:
> + maintainers and blamed authors
A second time, because something went wrong with the first attempt.
> On Thu, Jun 22, 2023 at 02:03:32PM -0500, Nick Child wrote:
> > All ibmvnic resets, make a call to netdev_tx_reset_queue() when
> > re-opening the device. netdev_tx_reset_queue() resets the num_queued
> > and num_completed byte counters. These stats are used in Byte Queue
> > Limit (BQL) algorithms. The difference between these two stats tracks
> > the number of bytes currently sitting on the physical NIC. ibmvnic
> > increases the number of queued bytes though calls to
> > netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports
> > that it is done transmitting bytes, the ibmvnic device increases the
> > number of completed bytes through calls to netdev_tx_completed_queue().
> > It is important to note that the driver batches its transmit calls and
> > num_queued is increased every time that an skb is added to the next
> > batch, not necessarily when the batch is sent to VIOS for transmission.
> >
> > Unlike other reset types, a NON FATAL reset will not flush the sub crq
> > tx buffers. Therefore, it is possible for the batched skb array to be
> > partially full. So if there is call to netdev_tx_reset_queue() when
> > re-opening the device, the value of num_queued (0) would not account
> > for the skb's that are currently batched. Eventually, when the batch
> > is sent to VIOS, the call to netdev_tx_completed_queue() would increase
> > num_completed to a value greater than the num_queued. This causes a
> > BUG_ON crash:
> >
> > ibmvnic 30000002: Firmware reports error, cause: adapter problem.
> > Starting recovery...
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ------------[ cut here ]------------
> > kernel BUG at lib/dynamic_queue_limits.c:27!
> > Oops: Exception in kernel mode, sig: 5
> > [....]
> > NIP dql_completed+0x28/0x1c0
> > LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic]
> > Call Trace:
> > ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable)
> > ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic]
> > __handle_irq_event_percpu+0x98/0x270
> > ---[ end trace ]---
> >
> > Therefore, do not reset the dql stats when performing a NON_FATAL reset.
> > Simply clearing the queues off bit is sufficient.
> >
> > Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls")
> > Signed-off-by: Nick Child <nnac123@...ux.ibm.com>
> > ---
> > drivers/net/ethernet/ibm/ibmvnic.c | 13 ++++++++++++-
> > 1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> > index c63d3ec9d328..5523ab52ff2b 100644
> > --- a/drivers/net/ethernet/ibm/ibmvnic.c
> > +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> > @@ -1816,7 +1816,18 @@ static int __ibmvnic_open(struct net_device *netdev)
> > if (prev_state == VNIC_CLOSED)
> > enable_irq(adapter->tx_scrq[i]->irq);
> > enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> > - netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> > + /* netdev_tx_reset_queue will reset dql stats and set the stacks
> > + * flag for queue status. During NON_FATAL resets, just
> > + * clear the bit, don't reset the stats because there could
> > + * be batched skb's waiting to be sent. If we reset dql stats,
> > + * we risk num_completed being greater than num_queued.
> > + * This will cause a BUG_ON in dql_completed().
> > + */
> > + if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
> > + clear_bit(__QUEUE_STATE_STACK_XOFF,
> > + &netdev_get_tx_queue(netdev, i)->state);
> > + else
> > + netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> > }
> >
> > rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> > --
> > 2.31.1
> >
> >
Powered by blists - more mailing lists