lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 6 Nov 2019 11:19:24 +0000
From:   Salil Mehta <salil.mehta@...wei.com>
To:     Marc Zyngier <maz@...nel.org>,
        "edumazet@...gle.com" <edumazet@...gle.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "lipeng (Y)" <lipeng321@...wei.com>,
        "Zhuangyuzeng (Yisen)" <yisen.zhuang@...wei.com>,
        "David S . Miller" <davem@...emloft.net>,
        Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH] net: hns: Ensure that interface teardown cannot race with
 TX interrupt

> From: Marc Zyngier [mailto:maz@...nel.org]
> Sent: Wednesday, November 6, 2019 8:18 AM
> To: Salil Mehta <salil.mehta@...wei.com>

Hi Marc,

> On Tue, 5 Nov 2019 18:41:11 +0000
> Salil Mehta <salil.mehta@...wei.com> wrote:
> 
> Hi Salil,
> 
> > Hi Marc,
> > I tested with the patch on D05 with the lockdep enabled kernel with below options
> > and I could not reproduce the deadlock. I do not argue the issue being mentioned
> > as this looks to be a clear bug which should hit while TX data-path is running
> > and we try to disable the interface.
> 
> Lockdep screaming at you doesn't mean the deadly scenario happens in
> practice, and I've never seen the machine hanging in these conditions.
> But I've also never tried to trigger it in anger.
> 
> > Could you please help me know the exact set of steps you used to get into this
> > problem. Also, are you able to re-create it easily/frequently?
> 
> I just need to issue "reboot" (which calls "ip link ... down") for this
> to trigger. Here's a full splat[1], as well as my full config[2]. It is
> 100% repeatable.
> 
> > # Kernel Config options:
> > CONFIG_LOCKDEP_SUPPORT=y
> > CONFIG_LOCKDEP=y
> 
> You'll need at least
> 
> CONFIG_PROVE_LOCKING=y
> CONFIG_NET_POLL_CONTROLLER=y


Few points:
1. To me netpoll causing spinlock deadlock with IRQ leg of TX and ip util is
    highly unlikely since netpoll runs with both RX/TX interrupts disabled.
    It runs in polling mode to facilitate parallel path to features like
    Netconsole, netdump etc. hence, deadlock because of the netpoll should
    be highly unlikely. Therefore, smells of some other problem here...
2. Also, I remember patch[s1][s2] from Eric Dumazet to disable netpoll on many
    NICs way back in 4.19 kernel on the basis of Song Liu's findings.

Problem: 
Aah, I see the problem now, it is because of the stray code related to the
NET_POLL_CONTROLLER in hns driver which actually should have got remove within
the patch[s1], and that also explains why it does not get hit while NET POLL
is disabled.


/* netif_tx_lock will turn down the performance, set only when necessary */
#ifdef CONFIG_NET_POLL_CONTROLLER
#define NETIF_TX_LOCK(ring) spin_lock(&(ring)->lock)
#define NETIF_TX_UNLOCK(ring) spin_unlock(&(ring)->lock)
#else
#define NETIF_TX_LOCK(ring)
#define NETIF_TX_UNLOCK(ring)
#endif


Once you define CONFIG_NET_POLL_CONTROLLER in the latest code these macros
Kick-in even for the normal NAPI path. Which can cause deadlock and that perhaps
is what you are seeing?

Now, the question is do we require these locks in normal NAPI poll? I do not
see that we need them anymore as Tasklets are serialized to themselves and
configuration path like "ip <intf> down" cannot conflict with NAPI poll path
as the later is always disabled prior performing interface down operation.
Hence, no conflict there.

As  a side analysis, I could figure out some contentions in the configuration
path not related to this though. :) 


Suggested Solution:
Since we do not have support of NET_POLL_CONTROLLER macros NETIF_TX_[UN]LOCK
We should remove these NET_POLL_CONTROLLER macros altogether for now.

Though, I still have not looked comprehensively how other are able to use
Debugging utils like netconsole etc without having NET_POLL_CONTROLLER.
Maybe @Eric Dumazet might give us some insight on this?


If you agree with this then I can send a patch to remove these from hns
driver. This should solve your problem as well?



[S1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4bd2c03be7
[S2] https://lkml.org/lkml/2018/10/4/32

> 
> in order to hit it.
> 
> Thanks,
> 
> 	M.
> 
> [1] https://paste.debian.net/1114451/
> [2] https://paste.debian.net/1114472/
> --
> Jazz is not dead. It just smells funny...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ