lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240709074230.GC346094@kernel.org>
Date: Tue, 9 Jul 2024 08:42:30 +0100
From: Simon Horman <horms@...nel.org>
To: "Loktionov, Aleksandr" <aleksandr.loktionov@...el.com>
Cc: "Nguyen, Anthony L" <anthony.l.nguyen@...el.com>,
	"Kang, Kelvin" <kelvin.kang@...el.com>,
	"Kubalewski, Arkadiusz" <arkadiusz.kubalewski@...el.com>,
	"intel-wired-lan@...ts.osuosl.org" <intel-wired-lan@...ts.osuosl.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove needless
 retries of NVM update

On Mon, Jul 08, 2024 at 03:38:11PM +0000, Loktionov, Aleksandr wrote:
> 
> 
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces@...osl.org> On Behalf
> > Of Simon Horman
> > Sent: Thursday, June 27, 2024 7:34 PM
> > To: Loktionov, Aleksandr <aleksandr.loktionov@...el.com>
> > Cc: Nguyen, Anthony L <anthony.l.nguyen@...el.com>; Kang, Kelvin
> > <kelvin.kang@...el.com>; Kubalewski, Arkadiusz
> > <arkadiusz.kubalewski@...el.com>; intel-wired-lan@...ts.osuosl.org;
> > netdev@...r.kernel.org
> > Subject: Re: [Intel-wired-lan] [PATCH iwl-net v5] i40e: fix: remove
> > needless retries of NVM update
> > 
> > On Tue, Jun 25, 2024 at 08:49:53PM +0200, Aleksandr Loktionov wrote:
> > > Remove wrong EIO to EGAIN conversion and pass all errors as is.
> > >
> > > After commit 230f3d53a547 ("i40e: remove i40e_status"), which should
> > > only replace F/W specific error codes with Linux kernel generic, all
> > > EIO errors suddenly started to be converted into EAGAIN which leads
> > > nvmupdate to retry until it timeouts and sometimes fails after more
> > > than 20 minutes in the middle of NVM update, so NVM becomes
> > corrupted.
> > >
> > > The bug affects users only at the time when they try to update NVM,
> > > and only F/W versions that generate errors while nvmupdate. For
> > > example, X710DA2 with 0x8000ECB7 F/W is affected, but there are
> > probably more...
> > >
> > > Command for reproduction is just NVM update:
> > >  ./nvmupdate64
> > >
> > > In the log instead of:
> > >  i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
> > > I40E_AQ_RC_ENOMEM)
> > > appears:
> > >  i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
> > >  i40e: eeprom check failed (-5), Tx/Rx traffic disabled
> > >
> > > The problematic code did silently convert EIO into EAGAIN which
> > forced
> > > nvmupdate to ignore EAGAIN error and retry the same operation until
> > timeout.
> > > That's why NVM update takes 20+ minutes to finish with the fail in
> > the end.
> > >
> > > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> > > Co-developed-by: Kelvin Kang <kelvin.kang@...el.com>
> > > Signed-off-by: Kelvin Kang <kelvin.kang@...el.com>
> > > Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@...el.com>
> > > Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@...el.com>
> > 
> > Hi Aleksandr,
> > 
> > Maybe I'm reading things wrong, I have concerns :(
> > 
> > Amongst other things, the cited commit:
> > 1. Maps a number of different I40E_ERR_* values to -EIO; and 2. Maps
> > checks on different I40E_ERR_* values to -EIO
> > 
> > My concern is that the code may now incorrectly match against -EIO for
> > cases where it would not have previously matched when more specific
> > error codes.
> > 
> > In the case at hand:
> > 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_ERROR 2.
> > i40e_aq_rc_to_posix checks for -EIO in place of
> > I40E_ERR_ADMIN_QUEUE_TIMEOUT
> > 
> > As you point out, we are now in a bad place.
> > Which your patch addresses.
> > 
> > But what about a different case where:
> > 1. -EIO is returned in place of I40E_ERR_ADMIN_QUEUE_TIMEOUT 2.
> > i40e_aq_rc_to_posix checks for -EIO in place of
> > I40E_ERR_ADMIN_QUEUE_TIMEOUT
> > 
> > In this scenario the, the code without your patch is correct, and with
> > your patch it seems incorrect.
> > 
> > Perhaps only the scenario you are fixing occurs.
> > If so, all good. But it's not obvious to me that is the case.
> > 
> > I'm likewise concerned by other conditions on -EIO introduced by the
> > cited commit.
> 
> This commit do not introduce -EIO errors.
> Before 230f3d53a547 ("i40e: remove i40e_status") some specific F/W error codes were
> converted into -EAGAIN by i40e_aq_rc_to_posix(), but now all error codes are already
> Linux kernel codes, so there is no way to distinguish special F/W codes and convert
> them into -EAGAIN.

Right, this last part is the nub of my concern.

> Our validation has been tested regressions of current patch and gave signed off.
> 
> Do you propose change 
> 	if (aq_ret == -EIO)
> 		return -EAGAIN;
> into
> 
> 	if (aq_ret == -EIO)
> 		return -EIO;
> ?
> 
> It will require additional testing...

If the problem I described is indeed a problem then a suspect a more
invasive change is required, to differentiate between the different
cases previously covered by internal error codes.

However, that is speculation on my part.
While your patch has been tested.

So I suggest, contrary to my previous email, that this patch moves forwards.

IOW, I am not blocking progress of this patch (anymore).

...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ