lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b60a6cd0c3934d52aec14b47b2218edf@huawei.com>
Date:   Wed, 6 Nov 2019 19:05:33 +0000
From:   Salil Mehta <salil.mehta@...wei.com>
To:     Marc Zyngier <maz@...nel.org>
CC:     "edumazet@...gle.com" <edumazet@...gle.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "lipeng (Y)" <lipeng321@...wei.com>,
        "Zhuangyuzeng (Yisen)" <yisen.zhuang@...wei.com>,
        "David S . Miller" <davem@...emloft.net>,
        Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH] net: hns: Ensure that interface teardown cannot race
  with TX interrupt

> From: Marc Zyngier [mailto:maz@...nel.org]
> Sent: Wednesday, November 6, 2019 12:17 PM
> To: Salil Mehta <salil.mehta@...wei.com>
> 
> On 2019-11-06 12:28, Salil Mehta wrote:
> > Hi Marc,
> >
> >> On Tue, 5 Nov 2019 18:41:11 +0000
> >> Salil Mehta <salil.mehta@...wei.com> wrote:
> >>
> >> Hi Salil,
> >>
> >> > Hi Marc,
> >> > I tested with the patch on D05 with the lockdep enabled kernel
> >> with below options
> >> > and I could not reproduce the deadlock. I do not argue the issue
> >> being mentioned
> >> > as this looks to be a clear bug which should hit while TX
> >> data-path is running
> >> > and we try to disable the interface.
> >>
> >> Lockdep screaming at you doesn't mean the deadly scenario happens in
> >> practice, and I've never seen the machine hanging in these
> >> conditions.
> >> But I've also never tried to trigger it in anger.
> >>
> >> > Could you please help me know the exact set of steps you used to
> >> get into this
> >> > problem. Also, are you able to re-create it easily/frequently?
> >>
> >> I just need to issue "reboot" (which calls "ip link ... down") for
> >> this
> >> to trigger. Here's a full splat[1], as well as my full config[2]. It
> >> is
> >> 100% repeatable.
> >>
> >> > # Kernel Config options:
> >> > CONFIG_LOCKDEP_SUPPORT=y
> >> > CONFIG_LOCKDEP=y
> >>
> >> You'll need at least
> >>
> >> CONFIG_PROVE_LOCKING=y
> >> CONFIG_NET_POLL_CONTROLLER=y
> >
> >
> > Few points:
> > 1. To me netpoll causing spinlock deadlock with IRQ leg of TX and ip
> > util is
> >     highly unlikely since netpoll runs with both RX/TX interrupts
> > disabled.
> >     It runs in polling mode to facilitate parallel path to features
> > like
> >     Netconsole, netdump etc. hence, deadlock because of the netpoll
> > should
> >     be highly unlikely. Therefore, smells of some other problem
> > here...
> > 2. Also, I remember patch[s1][s2] from Eric Dumazet to disable
> > netpoll on many
> >     NICs way back in 4.19 kernel on the basis of Song Liu's findings.
> >
> > Problem:
> > Aah, I see the problem now, it is because of the stray code related
> > to the
> > NET_POLL_CONTROLLER in hns driver which actually should have got
> > remove within
> > the patch[s1], and that also explains why it does not get hit while
> > NET POLL
> > is disabled.
> >
> >
> > /* netif_tx_lock will turn down the performance, set only when
> > necessary */
> > #ifdef CONFIG_NET_POLL_CONTROLLER
> > #define NETIF_TX_LOCK(ring) spin_lock(&(ring)->lock)
> > #define NETIF_TX_UNLOCK(ring) spin_unlock(&(ring)->lock)
> > #else
> > #define NETIF_TX_LOCK(ring)
> > #define NETIF_TX_UNLOCK(ring)
> > #endif
> >
> >
> > Once you define CONFIG_NET_POLL_CONTROLLER in the latest code these
> > macros
> > Kick-in even for the normal NAPI path. Which can cause deadlock and
> > that perhaps
> > is what you are seeing?
> 
> Yes, that's the problem.


Sure, thanks.


> > Now, the question is do we require these locks in normal NAPI poll? I
> > do not
> > see that we need them anymore as Tasklets are serialized to
> > themselves and
> > configuration path like "ip <intf> down" cannot conflict with NAPI
> > poll path
> > as the later is always disabled prior performing interface down
> > operation.
> > Hence, no conflict there.
> 
> My preference would indeed be to drop these per-queue locks if they
> aren't
> required. I couldn't figure out from a cursory look at the code whether
> two CPUs could serve the same TX queue. If that cannot happen by
> construction,
> then these locks are perfectly useless and should be removed.


Sure.


> > As  a side analysis, I could figure out some contentions in the
> > configuration
> > path not related to this though. :)
> >
> >
> > Suggested Solution:
> > Since we do not have support of NET_POLL_CONTROLLER macros
> > NETIF_TX_[UN]LOCK
> > We should remove these NET_POLL_CONTROLLER macros altogether for now.
> >
> > Though, I still have not looked comprehensively how other are able to
> > use
> > Debugging utils like netconsole etc without having
> > NET_POLL_CONTROLLER.
> > Maybe @Eric Dumazet might give us some insight on this?
> >
> >
> > If you agree with this then I can send a patch to remove these from
> > hns
> > driver. This should solve your problem as well?
> 
> Sure, as long as you can guarantee that these locks are never used for
> anything
> useful.


Hi Marc,
I have floated the patch. Could you please confirm if this solves your issue
and if possible provide your Tested-by? :)

[PATCH net] net: hns: Fix the stray netpoll locks causing deadlock in NAPI path

Thanks
Salil

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ