netdev - RE: [PATCH net] ibmvnic: Do not reset dql stats on NON

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <277ca0b3-32df-532f-e82f-a7d42dce54ab@linux.vnet.ibm.com>
Date: Mon, 26 Jun 2023 12:06:10 -0700
From: Rick Lindsley <ricklind@...ux.vnet.ibm.com>
To: Jakub Kicinski <kuba@...nel.org>, Nick Child <nnac123@...ux.ibm.com>
Cc: netdev@...r.kernel.org, haren@...ux.ibm.com, ricklind@...ibm.com
Subject: RE: [PATCH net] ibmvnic: Do not reset dql stats on NON_FATAL err

On 6/24/23 15:19, Jakub Kicinski wrote:

> Why are you trying to clear this bit?
> 
> If the completions will still come the bit will be cleared (or not)
> during completion handling (netdev_tx_completed_queue() et al.)
> 
> Drivers shouldn't be poking into queue state bits directly.

+		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
+			clear_bit(__QUEUE_STATE_STACK_XOFF,
+				  &netdev_get_tx_queue(netdev, i)->state);
+		else
+			netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));

The problem is the call to dql_reset() in netdev_tx_reset_queue().  If we reset dql stats,   we risk num_completed being greater than num_queued, which will cause a BUG_ON to fire in dql_completed().

     static inline void netdev_tx_reset_queue(struct netdev_queue *q)
     {
     #ifdef CONFIG_BQL
         clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
         dql_reset(&q->dql);
     #endif
     }

Existing code calls  netdev_tx_reset_queue() unconditionally.  When the error is non-fatal, though, we were tripping over the BUG_ON on those occasions when a few skbs were already submitted.  The patch here is more about NOT calling dql_reset() than it is about clearing __QUEUE_STATE_STACK_XOFF.  That was only included because it-was-the-other-thing-the-function-did.  So ... maybe we don't care about __QUEUE_STATE_STACK_XOFF?

Nick, do we *need* to have __QUEUE_STATE_STACK_XOFF cleared?  If not, then do we need to call or emulate netdev_tx_reset_queue() at all on a non-fatal error?