netdev - Re: rtnl_trylock() versus SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200806203922.3d687bf2@hermes.lan>
Date:   Thu, 6 Aug 2020 20:39:22 -0700
From:   Stephen Hemminger <stephen@...workplumber.org>
To:     Nikolay Aleksandrov <nikolay@...ulusnetworks.com>
Cc:     Rasmus Villemoes <rasmus.villemoes@...vas.dk>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: rtnl_trylock() versus SCHED_FIFO lockup

On Thu, 6 Aug 2020 12:46:43 +0300
Nikolay Aleksandrov <nikolay@...ulusnetworks.com> wrote:

> On 06/08/2020 12:17, Rasmus Villemoes wrote:
> > On 06/08/2020 01.34, Stephen Hemminger wrote:  
> >> On Wed, 5 Aug 2020 16:25:23 +0200
> >> Rasmus Villemoes <rasmus.villemoes@...vas.dk> wrote:
> >>  
> >>> Hi,
> >>>
> >>> We're seeing occasional lockups on an embedded board (running an -rt
> >>> kernel), which I believe I've tracked down to the
> >>>
> >>>             if (!rtnl_trylock())
> >>>                     return restart_syscall();
> >>>
> >>> in net/bridge/br_sysfs_br.c. The problem is that some SCHED_FIFO task
> >>> writes a "1" to the /sys/class/net/foo/bridge/flush file, while some
> >>> lower-priority SCHED_FIFO task happens to hold rtnl_lock(). When that
> >>> happens, the higher-priority task is stuck in an eternal ERESTARTNOINTR
> >>> loop, and the lower-priority task never gets runtime and thus cannot
> >>> release the lock.
> >>>
> >>> I've written a script that rather quickly reproduces this both on our
> >>> target and my desktop machine (pinning everything on one CPU to emulate
> >>> the uni-processor board), see below. Also, with this hacky patch  
> >>
> >> There is a reason for the trylock, it works around a priority inversion.  
> > 
> > Can you elaborate? It seems to me that it _causes_ a priority inversion
> > since priority inheritance doesn't have a chance to kick in.
> >   
> >> The real problem is expecting a SCHED_FIFO task to be safe with this
> >> kind of network operation.  
> > 
> > Maybe. But ignoring the SCHED_FIFO/rt-prio stuff, it also seems a bit
> > odd to do what is essentially a busy-loop - yes, the restart_syscall()
> > allows signals to be delivered (including allowing the process to get
> > killed), but in the absence of any signals, the pattern essentially
> > boils down to
> > 
> >   while (!rtnl_trylock())
> >     ;
> > 
> > So even for regular tasks, this seems to needlessly hog the cpu.
> > 
> > I tried this
> > 
> > diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> > index 0318a69888d4..e40e264f9b16 100644
> > --- a/net/bridge/br_sysfs_br.c
> > +++ b/net/bridge/br_sysfs_br.c
> > @@ -44,8 +44,8 @@ static ssize_t store_bridge_parm(struct device *d,
> >         if (endp == buf)
> >                 return -EINVAL;
> > 
> > -       if (!rtnl_trylock())
> > -               return restart_syscall();
> > +       if (rtnl_lock_interruptible())
> > +               return -ERESTARTNOINTR;
> > 
> >         err = (*set)(br, val);
> >         if (!err)
> > 
> > with the obvious definition of rtnl_lock_interruptible(), and it makes
> > the problem go away. Isn't it better to sleep waiting for the lock (and
> > with -rt, giving proper priority boost) or a signal to arrive rather
> > than busy-looping back and forth between syscall entry point and the
> > trylock()?
> > 
> > I see quite a lot of
> > 
> >     if (mutex_lock_interruptible(...))
> >             return -ERESTARTSYS;
> > 
> > but for the rtnl_mutex, I see the trylock...restart_syscall pattern
> > being used in a couple of places. So there must be something special
> > about the rtnl_mutex?
> > 
> > Thanks,
> > Rasmus
> >   
> 
> Hi Rasmus,
> I haven't tested anything but git history (and some grepping) points to deadlocks when
> sysfs entries are being changed under rtnl.
> For example check: af38f2989572704a846a5577b5ab3b1e2885cbfb and 336ca57c3b4e2b58ea3273e6d978ab3dfa387b4c
> This is a common usage pattern throughout net/, the bridge is not the only case and there are more
> commits which talk about deadlocks.
> Again I haven't verified anything but it seems on device delete (w/ rtnl held) -> sysfs delete
> would wait for current readers, but current readers might be stuck waiting on rtnl and we can deadlock.
> 

I was referring to AB BA lock inversion problems.

Yes the trylock goes back to:

commit af38f2989572704a846a5577b5ab3b1e2885cbfb
Author: Eric W. Biederman <ebiederm@...ssion.com>
Date:   Wed May 13 17:00:41 2009 +0000

    net: Fix bridgeing sysfs handling of rtnl_lock
    
    Holding rtnl_lock when we are unregistering the sysfs files can
    deadlock if we unconditionally take rtnl_lock in a sysfs file.  So fix
    it with the now familiar patter of: rtnl_trylock and syscall_restart()
    
    Signed-off-by: Eric W. Biederman <ebiederm@...stanetworks.com>
    Signed-off-by: David S. Miller <davem@...emloft.net>


The problem is that the unregister of netdevice happens under rtnl and
this unregister path has to remove sysfs and other objects.
So those objects have to have conditional locking.