netdev - Re: [PATCH 0/4][RFC] Disable e1000e power management if hardware error is detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <8412242.T7Z3S40VBb@natalenko.name>
Date: Sun, 14 Jul 2024 21:36:49 +0200
From: Oleksandr Natalenko <oleksandr@...alenko.name>
To: intel-wired-lan@...ts.osuosl.org, Chen Yu <yu.c.chen@...el.com>
Cc: "Neftin, Sasha" <sasha.neftin@...el.com>, Len Brown <len.brown@...el.com>,
 "Rafael J. Wysocki" <rjw@...ysocki.net>,
 "Brandt, Todd E" <todd.e.brandt@...el.com>, Zhang Rui <rui.zhang@...el.com>,
 Tony Nguyen <anthony.l.nguyen@...el.com>,
 Jesse Brandeburg <jesse.brandeburg@...el.com>, linux-kernel@...r.kernel.org,
 Chen Yu <yu.c.chen@...el.com>, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org
Subject:
 Re: [PATCH 0/4][RFC] Disable e1000e power management if hardware error is
 detected

Hello Yu.

On středa 11. listopadu 2020 6:50:35, SELČ Chen Yu wrote:
> This is a trial patchset that aims to cope with an intermittently
> triggered hardware error during system resume.
> 
> On some platforms the NIC's hardware error was detected during
> resume from S3, causing the NIC to not fully initialize
> and remain in unstable state afterwards. As a consequence
> the system fails to suspend due to incorrect NIC status.
> 
> In theory if the NIC could not be initialized after resumed,
> it should not do system/runtime suspend/resume afterwards.
> There are two proposals to deal with this situation:
> 
> Either:
> 1. Each time before the NIC going to suspend, check the status
>    of NIC by querying corresponding registers, bypass the suspend
>    callback on this NIC if it's unstable.
> 
> Or:
> 2. During NIC resume, if the hardware error was detected, removes
>    the NIC from power management list entirely.
> 
> Proposal 2 was chosen in this patch set because:
> 1. Proposal 1 requires that the driver queries the status
>    of the NIC in e1000e driver. However there seems to be
>    no specific registers for the e1000e to query the result
>    of NIC initialization.
> 2. Proposal 1 just bypass the suspend process but the power management
>    framework is still aware of this NIC, which might bring potential issue
>    in race condition.
> 3. Approach 2 is a clean solution and it is platform independent
>    that, not only e1000e, but also other drivers could leverage
>    this generic mechanism in the future.
> 
> Comments appreciated.
> 
> Chen Yu (4):
>   e1000e: save the return value of e1000e_reset()
>   PM: sleep: export device_pm_remove() for driver use
>   e1000e: Introduce workqueue to disable the power management
>   e1000e: Disable the power management if hardware error detected during
>     resume
> 
>  drivers/base/power/main.c                  |  1 +
>  drivers/base/power/power.h                 |  8 -------
>  drivers/net/ethernet/intel/e1000e/e1000.h  |  1 +
>  drivers/net/ethernet/intel/e1000e/netdev.c | 27 ++++++++++++++++++----
>  include/linux/pm.h                         | 12 ++++++++++
>  5 files changed, 37 insertions(+), 12 deletions(-)
> 
> 

It seems this submission was stuck at the RFC stage, and I'm not sure you got any feedback on it. Sorry for necrobumping this thread, but what is the current state of it?

I can confirm v6.8 is still affected (I've discovered this on T490s), and as a workaround I just unload e1000e module before doing S3, and load it back after resume.

For LKML reference, the linked kernel BZ is: https://bugzilla.kernel.org/show_bug.cgi?id=205015

Thank you.

-- 
Oleksandr Natalenko (post-factum)
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)