netdev - Re: [PATCH net] ibmvnic: Do not reset dql stats on NON

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <73a55db3-1298-9db7-3a81-f34f86587bfe@linux.ibm.com>
Date: Mon, 26 Jun 2023 14:23:49 -0500
From: Nick Child <nnac123@...ux.ibm.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, haren@...ux.ibm.com, ricklind@...ibm.com
Subject: Re: [PATCH net] ibmvnic: Do not reset dql stats on NON_FATAL err

On 6/26/23 12:33, Jakub Kicinski wrote:
> 
> If packets can get stuck in the driver that needs a dedicated fix.
> 
> You should only be doing the batching if xmit_more is set, really.
> And xmit_more will not be set if core is about to set
> __QUEUE_STATE_STACK_XOFF.
> 

I am not saying that packets can get stuck in the driver, if xmit_more 
is false or if the driver is hard resetting then the batch gets sent.

During nonfatal reset, we are simply reestablishing communications with 
the NIC and it is okay to keep the batch around. It is possible that 
there are batched skb's because xmit_more and the non fatal reset work 
independently of each-other and are not mutually exclusive. The upper 
level functions have no way of knowing when an issue from the NIC will 
occur.

> With a correctly written driver STACK_XOFF can only be set if we're
> expecting a completion.
> 
> AFAIU you shouldn't have to clear this flag. We need to reset DQL when
> hard resetting the queue because we may lose completions. But if no
> completions are lost, STACK_XOFF should be left alone.
> 
> Just clearing that flag is not valid either because qdiscs will not be
> rescheduled, so until some socket tries to xmit next packet the data
> already queued will not move.

This addresses my concern. So if STACK_XOFF gets set, then xmit_more 
will be false and our batch will get sent. This will eventually lead to 
a completion which will clear STACK_XOFF. Meaning the reset should not 
have to worry about clearing STACK_XOFF.

My worry was that STACK_XOFF would get set when the batch was only 
partially full and xmit_more was true. If a non fatal reset occurred 
here then there would be no expected completions and a partially filled 
batch. So I thought we would have to clear STACK_XOFF in order to get 
the queue to be usable again. Sounds like this is not possible due to 
xmit_more being false if STACK_XOFF gets set.

On 6/26/23 14:06, Rick Lindsley wrote:
 >
 > Nick, do we *need* to have __QUEUE_STATE_STACK_XOFF cleared?  If not,
 > then do we need to call or emulate netdev_tx_reset_queue() at all on a
 > non-fatal error?

After reading Jakubs review. I believe we should remove the part where 
we clear STACK_XOFF.

If there are no objections, I will test these changes and send a v2.

Thanks