[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CO1PR11MB508906CCB79542A1531B506AD662A@CO1PR11MB5089.namprd11.prod.outlook.com>
Date: Mon, 2 Jun 2025 21:32:48 +0000
From: "Keller, Jacob E" <jacob.e.keller@...el.com>
To: "Damato, Joe" <jdamato@...tly.com>, Jakub Kicinski <kuba@...nel.org>
CC: Stanislav Fomichev <stfomichev@...il.com>, "netdev@...r.kernel.org"
<netdev@...r.kernel.org>, "john.cs.hey@...il.com" <john.cs.hey@...il.com>,
"syzbot+846bb38dc67fe62cc733@...kaller.appspotmail.com"
<syzbot+846bb38dc67fe62cc733@...kaller.appspotmail.com>, "Nguyen, Anthony L"
<anthony.l.nguyen@...el.com>, "Kitszel, Przemyslaw"
<przemyslaw.kitszel@...el.com>, Andrew Lunn <andrew+netdev@...n.ch>, "David
S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, "Paolo
Abeni" <pabeni@...hat.com>, "moderated list:INTEL ETHERNET DRIVERS"
<intel-wired-lan@...ts.osuosl.org>, open list <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock
> -----Original Message-----
> From: Joe Damato <jdamato@...tly.com>
> Sent: Monday, June 2, 2025 1:32 PM
> To: Jakub Kicinski <kuba@...nel.org>
> Cc: Stanislav Fomichev <stfomichev@...il.com>; netdev@...r.kernel.org;
> john.cs.hey@...il.com; Keller, Jacob E <jacob.e.keller@...el.com>;
> syzbot+846bb38dc67fe62cc733@...kaller.appspotmail.com; Nguyen, Anthony L
> <anthony.l.nguyen@...el.com>; Kitszel, Przemyslaw
> <przemyslaw.kitszel@...el.com>; Andrew Lunn <andrew+netdev@...n.ch>; David
> S. Miller <davem@...emloft.net>; Eric Dumazet <edumazet@...gle.com>;
> Paolo Abeni <pabeni@...hat.com>; moderated list:INTEL ETHERNET DRIVERS
> <intel-wired-lan@...ts.osuosl.org>; open list <linux-kernel@...r.kernel.org>
> Subject: Re: [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock
>
> On Fri, May 30, 2025 at 06:31:40PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 May 2025 12:45:13 -0700 Joe Damato wrote:
> > > > nit: as Jakub mentioned in another thread, it seems more about the
> > > > flush_work waiting for the reset_task to complete rather than
> > > > wq mutexes (which are fake)?
> > >
> > > Hm, I probably misunderstood something. Also, not sure what you
> > > meant by the wq mutexes being fake?
> > >
> > > My understanding (which is prob wrong) from the syzbot and user
> > > report was that the order of wq mutex and rtnl are inverted in the
> > > two paths, which can cause a deadlock if both paths run.
> >
> > Take a look at touch_work_lockdep_map(), theres nosaj thing as wq mutex.
> > It's just a lockdep "annotation" that helps lockdep connect the dots
> > between waiting thread and the work item, not a real mutex. So the
> > commit msg may be better phrased like this (modulo the lines in front):
> >
> > CPU 0:
> > , - RTNL is held
> > / - e1000_close
> > | - e1000_down
> > +- - cancel_work_sync (cancel / wait for e1000_reset_task())
> > |
> > | CPU 1:
> > | - process_one_work
> > \ - e1000_reset_task
> > `- take RTNL
>
> OK, I'll resubmit shortly with the following commit message:
>
> e1000: Move cancel_work_sync to avoid deadlock
>
> Previously, e1000_down called cancel_work_sync for the e1000 reset task
> (via e1000_down_and_stop), which takes RTNL.
>
> As reported by users and syzbot, a deadlock is possible in the following
> scenario:
>
> CPU 0:
> - RTNL is held
> - e1000_close
> - e1000_down
> - cancel_work_sync (cancel / wait for e1000_reset_task())
>
> CPU 1:
> - process_one_work
> - e1000_reset_task
> - take RTNL
>
> To remedy this, avoid calling cancel_work_sync from e1000_down
> (e1000_reset_task does nothing if the device is down anyway). Instead,
> call cancel_work_sync for e1000_reset_task when the device is being
> removed.
Acked-by: Jacob Keller <jacob.e.keller@...el.com>
Powered by blists - more mailing lists