[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251224165301.2794-1-noam.d.eliyahu@gmail.com>
Date: Wed, 24 Dec 2025 18:53:01 +0200
From: "Noam D. Eliyahu" <noam.d.eliyahu@...il.com>
To: pavan.chebbi@...adcom.com
Cc: mchan@...adcom.com,
netdev@...r.kernel.org
Subject: Re: [DISCUSS] tg3 reboot handling on Dell T440 (BCM5720)
Thanks for the quick reply!
On Mon, Dec 22, 2025 at 11:23 AM Pavan Chebbi <pavan.chebbi@...adcom.com> wrote:
>
> On Sun, Dec 21, 2025 at 11:20 PM Noam D. Eliyahu
> <noam.d.eliyahu@...il.com> wrote:
> >
> > ### Relevant driver evolution
> >
> > * **9931c9d04f4d – tg3: power down device only on SYSTEM_POWER_OFF**
>
> I think this commit id is wrong. Anyway, I know the commit.
My apologies, I likely copied a local hash; the upstream commit ID is 9fc3bc764334.
> This is going to be a problem, please follow the discussion here:
> https://lore.kernel.org/netdev/CALs4sv1-6mgQ2JfF9MYiRADxumJD7m7OGWhCB5aWj1tGP0OPJg@mail.gmail.com/
> where regression risk is flagged and it came true in
> https://lore.kernel.org/netdev/CALs4sv2_JZd5K-ZgBkjL=QpXVEXnoJrjuqwwKg0+jo2-4taHJw@mail.gmail.com/
Thank you for the links.
I understand the need to prevent regressions, especially in an area where it happened before.
That said, I still think the design of the first and second fixes is problematic and needs adjusting.
The original bug (Fixed in: 9fc3bc764334) was triggered by SNP initialization on specific models (R650xs with BCM5720).
The fix, the conditional tg3_power_down call, *was applied globally regardless of models*.
The second bug I mentioned (Fixed in: e0efe83ed3252) was triggered mainly due to the conditional tg3_power_down call.
Look again at the changes made in the commit referenced by e0efe83ed3252 (2ca1c94ce0b6 as the original fix):
```
+ tg3_reset_task_cancel(tp);
+
rtnl_lock();
+
netif_device_detach(dev);
if (netif_running(dev))
dev_close(dev);
- if (system_state == SYSTEM_POWER_OFF)
- tg3_power_down(tp);
+ tg3_power_down(tp); /* NOTE: the conditional system state based tg3_power_down call was problematic */
rtnl_unlock();
+
+ pci_disable_device(pdev);
}
```
The changes in 2ca1c94ce0b6 caused the regression which later led to the AER disablement in e0efe83ed3252.
The problem is that it was decided to apply the change to a specific set of models, even though it originated from 9fc3bc764334 which was applied globally.
If we apply the conditional tg3_power_down specifically for the R650xs, we can guarantee no regression, as the logic for the models with the first bug stays the same, just now limited to their set of models.
By applying the conditional tg3_power_down this way, we won't need the AER disablement at all.
> >
> > 2. **Flip the conditioning**
> > Keep the DMI list, but use it to guard the conditional tg3_power_down instead (only for models where the original issue was observed, e.g. R650xs). Drop the AER handling entirely. This limits risk to known systems while simplifying the flow.
> But I am not sure how systems affected in the commit e0efe83ed3252 will react. Can't tell 100pc without testing.
Regarding your concern about systems affected in e0efe83ed3252: my hardware (Dell PowerEdge T440) is one of the models affected in e0efe83ed3252 but not listed in the initial DMI match list.
I tested all of the solutions I suggested, others have reported the same regarding my first suggestion (to remove both the conditional tg3_power_down and the AER disablement) online, and most importantly, the e0efe83ed3252 commit itself references the original fix (2ca1c94ce0b6) which didn't include a specific set of models and was considered a viable fix.
If we restrict the system_state check (from 9fc3bc764334) to a DMI table for the R650xs, all other systems would revert to the 'unconditional' tg3_power_down which was the standard for years. This would naturally prevent the AER errors (as I and others had seen on our machines) without needing to touch the AER registers at all.
For reference:
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e0efe83ed3252
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2ca1c94ce0b6
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9fc3bc764334
Best regards,
Noam D. Eliyahu
Powered by blists - more mailing lists