[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2311b5965adb4ccea83b6072115efc6c@huawei.com>
Date: Wed, 6 Nov 2019 11:19:24 +0000
From: Salil Mehta <salil.mehta@...wei.com>
To: Marc Zyngier <maz@...nel.org>,
"edumazet@...gle.com" <edumazet@...gle.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"lipeng (Y)" <lipeng321@...wei.com>,
"Zhuangyuzeng (Yisen)" <yisen.zhuang@...wei.com>,
"David S . Miller" <davem@...emloft.net>,
Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH] net: hns: Ensure that interface teardown cannot race with
TX interrupt
> From: Marc Zyngier [mailto:maz@...nel.org]
> Sent: Wednesday, November 6, 2019 8:18 AM
> To: Salil Mehta <salil.mehta@...wei.com>
Hi Marc,
> On Tue, 5 Nov 2019 18:41:11 +0000
> Salil Mehta <salil.mehta@...wei.com> wrote:
>
> Hi Salil,
>
> > Hi Marc,
> > I tested with the patch on D05 with the lockdep enabled kernel with below options
> > and I could not reproduce the deadlock. I do not argue the issue being mentioned
> > as this looks to be a clear bug which should hit while TX data-path is running
> > and we try to disable the interface.
>
> Lockdep screaming at you doesn't mean the deadly scenario happens in
> practice, and I've never seen the machine hanging in these conditions.
> But I've also never tried to trigger it in anger.
>
> > Could you please help me know the exact set of steps you used to get into this
> > problem. Also, are you able to re-create it easily/frequently?
>
> I just need to issue "reboot" (which calls "ip link ... down") for this
> to trigger. Here's a full splat[1], as well as my full config[2]. It is
> 100% repeatable.
>
> > # Kernel Config options:
> > CONFIG_LOCKDEP_SUPPORT=y
> > CONFIG_LOCKDEP=y
>
> You'll need at least
>
> CONFIG_PROVE_LOCKING=y
> CONFIG_NET_POLL_CONTROLLER=y
Few points:
1. To me netpoll causing spinlock deadlock with IRQ leg of TX and ip util is
highly unlikely since netpoll runs with both RX/TX interrupts disabled.
It runs in polling mode to facilitate parallel path to features like
Netconsole, netdump etc. hence, deadlock because of the netpoll should
be highly unlikely. Therefore, smells of some other problem here...
2. Also, I remember patch[s1][s2] from Eric Dumazet to disable netpoll on many
NICs way back in 4.19 kernel on the basis of Song Liu's findings.
Problem:
Aah, I see the problem now, it is because of the stray code related to the
NET_POLL_CONTROLLER in hns driver which actually should have got remove within
the patch[s1], and that also explains why it does not get hit while NET POLL
is disabled.
/* netif_tx_lock will turn down the performance, set only when necessary */
#ifdef CONFIG_NET_POLL_CONTROLLER
#define NETIF_TX_LOCK(ring) spin_lock(&(ring)->lock)
#define NETIF_TX_UNLOCK(ring) spin_unlock(&(ring)->lock)
#else
#define NETIF_TX_LOCK(ring)
#define NETIF_TX_UNLOCK(ring)
#endif
Once you define CONFIG_NET_POLL_CONTROLLER in the latest code these macros
Kick-in even for the normal NAPI path. Which can cause deadlock and that perhaps
is what you are seeing?
Now, the question is do we require these locks in normal NAPI poll? I do not
see that we need them anymore as Tasklets are serialized to themselves and
configuration path like "ip <intf> down" cannot conflict with NAPI poll path
as the later is always disabled prior performing interface down operation.
Hence, no conflict there.
As a side analysis, I could figure out some contentions in the configuration
path not related to this though. :)
Suggested Solution:
Since we do not have support of NET_POLL_CONTROLLER macros NETIF_TX_[UN]LOCK
We should remove these NET_POLL_CONTROLLER macros altogether for now.
Though, I still have not looked comprehensively how other are able to use
Debugging utils like netconsole etc without having NET_POLL_CONTROLLER.
Maybe @Eric Dumazet might give us some insight on this?
If you agree with this then I can send a patch to remove these from hns
driver. This should solve your problem as well?
[S1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4bd2c03be7
[S2] https://lkml.org/lkml/2018/10/4/32
>
> in order to hit it.
>
> Thanks,
>
> M.
>
> [1] https://paste.debian.net/1114451/
> [2] https://paste.debian.net/1114472/
> --
> Jazz is not dead. It just smells funny...
Powered by blists - more mailing lists