[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200805163425.6c13ef11@hermes.lan>
Date: Wed, 5 Aug 2020 16:34:25 -0700
From: Stephen Hemminger <stephen@...workplumber.org>
To: Rasmus Villemoes <rasmus.villemoes@...vas.dk>
Cc: Network Development <netdev@...r.kernel.org>
Subject: Re: rtnl_trylock() versus SCHED_FIFO lockup
On Wed, 5 Aug 2020 16:25:23 +0200
Rasmus Villemoes <rasmus.villemoes@...vas.dk> wrote:
> Hi,
>
> We're seeing occasional lockups on an embedded board (running an -rt
> kernel), which I believe I've tracked down to the
>
> if (!rtnl_trylock())
> return restart_syscall();
>
> in net/bridge/br_sysfs_br.c. The problem is that some SCHED_FIFO task
> writes a "1" to the /sys/class/net/foo/bridge/flush file, while some
> lower-priority SCHED_FIFO task happens to hold rtnl_lock(). When that
> happens, the higher-priority task is stuck in an eternal ERESTARTNOINTR
> loop, and the lower-priority task never gets runtime and thus cannot
> release the lock.
>
> I've written a script that rather quickly reproduces this both on our
> target and my desktop machine (pinning everything on one CPU to emulate
> the uni-processor board), see below. Also, with this hacky patch
There is a reason for the trylock, it works around a priority inversion.
The real problem is expecting a SCHED_FIFO task to be safe with this
kind of network operation.
Powered by blists - more mailing lists