lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 10 Jun 2024 14:25:39 +0200
From: Paul Menzel <pmenzel@...gen.mpg.de>
To: Aleksandr Loktionov <aleksandr.loktionov@...el.com>,
 Kelvin Kang <kelvin.kang@...el.com>
Cc: Jan Sokolowski <jan.sokolowski@...el.com>, netdev@...r.kernel.org,
 Arkadiusz Kubalewski <arkadiusz.kubalewski@...el.com>,
 Anthony L Nguyen <anthony.l.nguyen@...el.com>,
 intel-wired-lan@...ts.osuosl.org, Leon Romanovsky <leonro@...dia.com>
Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue NVM content
 is corrupted after nvmupdate

Dear Aleksandr,


Thank you for your response.


Am 10.06.24 um 12:20 schrieb Loktionov, Aleksandr:
> 
> 
>> -----Original Message-----
>> From: Intel-wired-lan <intel-wired-lan-bounces@...osl.org> On Behalf Of Loktionov, Aleksandr
>> Sent: Monday, June 10, 2024 12:16 PM

>>> -----Original Message-----
>>> From: Loktionov, Aleksandr
>>> Sent: Monday, June 10, 2024 12:14 PM

>>>> -----Original Message-----
>>>> From: Paul Menzel <pmenzel@...gen.mpg.de>
>>>> Sent: Monday, June 10, 2024 11:45 AM

>>>> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
>>>>> After 230f3d53a547 patch all I/O errors are being converted into
>>>>> EAGAIN which leads to retries until timeout so nvmupdate sometimes
>>>>> fails after more than 20 minutes!
>>>>>
>>>>> Remove misleading EIO to EGAIN conversion and pass all errors as
>>>> is.
>>>>>
>>>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
>>>>
>>>> This commit is present since v6.6-rc1, released September last year
>>>> (2023). So until now, nobody noticed this?
>>>>
>>> Really, really. The regression affects users only when they update
>>> F/W, and not all F/W are affected, only that generate I/O errors while
>>> update.
>> Not all the cards are affected, but the consequences are serous as
>> in subj.
>>
>>>>> Co-developed-by: Kelvin Kang <kelvin.kang@...el.com>
>>>>> Signed-off-by: Kelvin Kang <kelvin.kang@...el.com>
>>>>> Reviewed-by: Arkadiusz Kubalewski
>>>> <arkadiusz.kubalewski@...el.com>
>>>>> Signed-off-by: Aleksandr Loktionov
>>>> <aleksandr.loktionov@...el.com>
>>>>
>>>> Please give more details about your test setup. For me it’s also not
>>>> clear, how the NVM content gets corrupted as stated in the
>>>> summary/title. Could you please elaborate that in the commit message.

> For example X710DA2 with 0x8000ECB7 is affected, but there are
> probably more...

Please amend the commit message with this information, and for ease also 
the commands you executed.

> The corruption is already described - because of
> timeout nvmupdate timeouts failing to update NVM.

Only because something times out, does not mean it causes corruption.

Please amend the commit message. It looks like you also missed more of 
my questions below.

>>>>> ---
>>>>>    drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>>>>>    1 file changed, 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> index ee86d2c..55b5bb8 100644
>>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>>>>>    		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>>>>>    	};
>>>>>
>>>>> -	/* aq_rc is invalid if AQ timed out */
>>>>> -	if (aq_ret == -EIO)
>>>>> -		return -EAGAIN;
>>>>> -
>>>>>    	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>>>>>    		return -ERANGE;
>>>>
>>>> The referenced commit 230f3d53a547 does:
>>>>
>>>> ```
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> index ee394aacef4d..267f2e0a21ce 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> @@ -5,7 +5,6 @@
>>>>    #define _I40E_ADMINQ_H_
>>>>
>>>>    #include "i40e_osdep.h"
>>>> -  #include "i40e_status.h"
>>>>    #include "i40e_adminq_cmd.h"
>>>>
>>>>    #define I40E_ADMINQ_DESC(R, i)   \
>>>> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>>>>           };
>>>>
>>>>           /* aq_rc is invalid if AQ timed out */
>>>> -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
>>>> +       if (aq_ret == -EIO)
>>>>                   return -EAGAIN;
>>>>
>>>>           if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>>>> ```
>>>>
>>>> So I do not see yet, why removing the whole hunk is the solution.

Kind regards,

Paul


PS: Would it be possible, that you use another email program. The cited 
parts are wrapped awkwardly, and it takes some time to correct it in my 
response.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ