lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 27 Jun 2024 18:33:51 +0100
From: Simon Horman <horms@...nel.org>
To: Aleksandr Loktionov <aleksandr.loktionov@...el.com>
Cc: intel-wired-lan@...ts.osuosl.org, anthony.l.nguyen@...el.com,
	netdev@...r.kernel.org, Kelvin Kang <kelvin.kang@...el.com>,
	Arkadiusz Kubalewski <arkadiusz.kubalewski@...el.com>
Subject: Re: [PATCH iwl-net v5] i40e: fix: remove needless retries of NVM
 update

On Tue, Jun 25, 2024 at 08:49:53PM +0200, Aleksandr Loktionov wrote:
> Remove wrong EIO to EGAIN conversion and pass all errors as is.
> 
> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
> replace F/W specific error codes with Linux kernel generic, all EIO errors
> suddenly started to be converted into EAGAIN which leads nvmupdate to retry
> until it timeouts and sometimes fails after more than 20 minutes in the
> middle of NVM update, so NVM becomes corrupted.
> 
> The bug affects users only at the time when they try to update NVM, and
> only F/W versions that generate errors while nvmupdate. For example, X710DA2
> with 0x8000ECB7 F/W is affected, but there are probably more...
> 
> Command for reproduction is just NVM update:
>  ./nvmupdate64
> 
> In the log instead of:
>  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
> appears:
>  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
>  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> 
> The problematic code did silently convert EIO into EAGAIN which forced
> nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
> That's why NVM update takes 20+ minutes to finish with the fail in the end.
> 
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> Co-developed-by: Kelvin Kang <kelvin.kang@...el.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@...el.com>
> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@...el.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@...el.com>

Hi Aleksandr,

Maybe I'm reading things wrong, I have concerns :(

Amongst other things, the cited commit:
1. Maps a number of different I40E_ERR_* values to -EIO; and
2. Maps checks on different I40E_ERR_* values to -EIO

My concern is that the code may now incorrectly match against -EIO
for cases where it would not have previously matched when more
specific error codes.

In the case at hand:
1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_ERROR
2. i40e_aq_rc_to_posix checks for -EIO in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT

As you point out, we are now in a bad place.
Which your patch addresses.

But what about a different case where:
1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT
2. i40e_aq_rc_to_posix checks for -EIO in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT

In this scenario the, the code without your patch is correct,
and with your patch it seems incorrect.

Perhaps only the scenario you are fixing occurs.
If so, all good. But it's not obvious to me that is the case.

I'm likewise concerned by other conditions on -EIO
introduced by the cited commit.

> ---
> v4->v5 commit message update
> https://lore.kernel.org/netdev/20240618132111.3193963-1-aleksandr.loktionov@intel.com/T/#u
> v3->v4 commit message update
> v2->v3 commit messege typos
> v1->v2 commit message update
> ---
>  drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee86d2c..55b5bb8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>  		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>  	};
>  
> -	/* aq_rc is invalid if AQ timed out */
> -	if (aq_ret == -EIO)
> -		return -EAGAIN;
> -

Perhaps it has already been covered, but with this change the aq_ret
argument of this function is longer used.  It could be removed as a
follow-up for iwl-next.

>  	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>  		return -ERANGE;
>  
> -- 
> 2.25.1
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ