[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBG_jm62ngj0Mqq-@0ec9f3ddc3bf>
Date: Wed, 30 Apr 2025 09:13:34 +0300
From: Ian Ray <ian.ray@...ealthcare.com>
To: Simon Horman <horms@...nel.org>
Cc: Tony Nguyen <anthony.l.nguyen@...el.com>,
Przemek Kitszel <przemyslaw.kitszel@...el.com>,
Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
brian.ruley@...ealthcare.com, intel-wired-lan@...ts.osuosl.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
Toke Høiland-Jørgensen <toke@...hat.com>,
ian.ray@...ealthcare.com
Subject: Re: [PATCH] igb: Fix watchdog_task race with shutdown
On Tue, Apr 29, 2025 at 04:20:21PM +0100, Simon Horman wrote:
> + Toke
>
> On Mon, Apr 28, 2025 at 02:54:49PM +0300, Ian Ray wrote:
> > A rare [1] race condition is observed between the igb_watchdog_task and
> > shutdown on a dual-core i.MX6 based system with two I210 controllers.
> >
> > Using printk, the igb_watchdog_task is hung in igb_read_phy_reg because
> > __igb_shutdown has already called __igb_close.
> >
> > Fix this by locking in igb_watchdog_task (in the same way as is done in
> > igb_reset_task).
> >
> > reboot kworker
> >
> > __igb_shutdown
> > rtnl_lock
> > __igb_close
> > : igb_watchdog_task
> > : :
> > : igb_read_phy_reg (hung)
> > rtnl_unlock
> >
> > [1] Note that this is easier to reproduce with 'initcall_debug' logging
> > and additional and printk logging in igb_main.
> >
> > Signed-off-by: Ian Ray <ian.ray@...ealthcare.com>
>
> Hi Ian,
>
> Thanks for your patch.
>
> While I think that the simplicity of this approach may well be appropriate
> as a fix for the problem described I do have a concern.
>
> I am worried that taking RTNL each time the watchdog tasks will create
> unnecessary lock contention. That may manifest in weird and wonderful ways
> in future. Maybe this patch doesn't make things materially worse in that
> regard. But it would be nice to have a plan to move away from using RTNL,
> as is happening elsewhere.
>
> ...
Hi Simon,
Many thanks for the review. I've been reflecting on the patch (and
discussing internally) and I think it would be better to model the
behaviour on igb_remove instead of igb_reset_task. Meaning that the
timer should be deleted, and the work cancelled, after setting bit
IGB_DOWN. This would mirror igb_up. (And has the advantage of not
using the RTNL.)
(As you can probably tell) I am not very familiar with this subsystem,
but the modified proposal, below, works well in my testing. I will
happily send a V2 if you think this is a better direction.
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 291348505868..d4b905469cc2 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2173,10 +2173,14 @@ void igb_down(struct igb_adapter *adapter)
u32 tctl, rctl;
int i;
- /* signal that we're down so the interrupt handler does not
- * reschedule our watchdog timer
+ /* The watchdog timer may be rescheduled, so explicitly
+ * disable watchdog from being rescheduled.
*/
set_bit(__IGB_DOWN, &adapter->state);
+ del_timer_sync(&adapter->watchdog_timer);
+ del_timer_sync(&adapter->phy_info_timer);
+
+ cancel_work_sync(&adapter->watchdog_task);
/* disable receives in the hardware */
rctl = rd32(E1000_RCTL);
@@ -2207,11 +2211,6 @@ void igb_down(struct igb_adapter *adapter)
}
}
- del_timer_sync(&adapter->watchdog_timer);
- del_timer_sync(&adapter->phy_info_timer);
-
- cancel_work_sync(&adapter->watchdog_task);
-
/* record the stats before reset*/
spin_lock(&adapter->stats64_lock);
igb_update_stats(adapter);
Powered by blists - more mailing lists