lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 06 Nov 2019 13:25:56 +0109
From:   Marc Zyngier <maz@...nel.org>
To:     Salil Mehta <salil.mehta@...wei.com>
Cc:     <edumazet@...gle.com>, <netdev@...r.kernel.org>,
        "lipeng (Y)" <lipeng321@...wei.com>,
        "Zhuangyuzeng (Yisen)" <yisen.zhuang@...wei.com>,
        "David S . Miller" <davem@...emloft.net>,
        Linuxarm <linuxarm@...wei.com>
Subject: RE: [PATCH] net: hns: Ensure that interface teardown cannot race  with TX interrupt

On 2019-11-06 12:28, Salil Mehta wrote:
> Hi Marc,
>
>> On Tue, 5 Nov 2019 18:41:11 +0000
>> Salil Mehta <salil.mehta@...wei.com> wrote:
>>
>> Hi Salil,
>>
>> > Hi Marc,
>> > I tested with the patch on D05 with the lockdep enabled kernel 
>> with below options
>> > and I could not reproduce the deadlock. I do not argue the issue 
>> being mentioned
>> > as this looks to be a clear bug which should hit while TX 
>> data-path is running
>> > and we try to disable the interface.
>>
>> Lockdep screaming at you doesn't mean the deadly scenario happens in
>> practice, and I've never seen the machine hanging in these 
>> conditions.
>> But I've also never tried to trigger it in anger.
>>
>> > Could you please help me know the exact set of steps you used to 
>> get into this
>> > problem. Also, are you able to re-create it easily/frequently?
>>
>> I just need to issue "reboot" (which calls "ip link ... down") for 
>> this
>> to trigger. Here's a full splat[1], as well as my full config[2]. It 
>> is
>> 100% repeatable.
>>
>> > # Kernel Config options:
>> > CONFIG_LOCKDEP_SUPPORT=y
>> > CONFIG_LOCKDEP=y
>>
>> You'll need at least
>>
>> CONFIG_PROVE_LOCKING=y
>> CONFIG_NET_POLL_CONTROLLER=y
>
>
> Few points:
> 1. To me netpoll causing spinlock deadlock with IRQ leg of TX and ip 
> util is
>     highly unlikely since netpoll runs with both RX/TX interrupts 
> disabled.
>     It runs in polling mode to facilitate parallel path to features 
> like
>     Netconsole, netdump etc. hence, deadlock because of the netpoll 
> should
>     be highly unlikely. Therefore, smells of some other problem 
> here...
> 2. Also, I remember patch[s1][s2] from Eric Dumazet to disable
> netpoll on many
>     NICs way back in 4.19 kernel on the basis of Song Liu's findings.
>
> Problem:
> Aah, I see the problem now, it is because of the stray code related 
> to the
> NET_POLL_CONTROLLER in hns driver which actually should have got
> remove within
> the patch[s1], and that also explains why it does not get hit while 
> NET POLL
> is disabled.
>
>
> /* netif_tx_lock will turn down the performance, set only when 
> necessary */
> #ifdef CONFIG_NET_POLL_CONTROLLER
> #define NETIF_TX_LOCK(ring) spin_lock(&(ring)->lock)
> #define NETIF_TX_UNLOCK(ring) spin_unlock(&(ring)->lock)
> #else
> #define NETIF_TX_LOCK(ring)
> #define NETIF_TX_UNLOCK(ring)
> #endif
>
>
> Once you define CONFIG_NET_POLL_CONTROLLER in the latest code these 
> macros
> Kick-in even for the normal NAPI path. Which can cause deadlock and
> that perhaps
> is what you are seeing?

Yes, that's the problem.

> Now, the question is do we require these locks in normal NAPI poll? I 
> do not
> see that we need them anymore as Tasklets are serialized to 
> themselves and
> configuration path like "ip <intf> down" cannot conflict with NAPI 
> poll path
> as the later is always disabled prior performing interface down 
> operation.
> Hence, no conflict there.

My preference would indeed be to drop these per-queue locks if they 
aren't
required. I couldn't figure out from a cursory look at the code whether
two CPUs could serve the same TX queue. If that cannot happen by 
construction,
then these locks are perfectly useless and should be removed.

>
> As  a side analysis, I could figure out some contentions in the 
> configuration
> path not related to this though. :)
>
>
> Suggested Solution:
> Since we do not have support of NET_POLL_CONTROLLER macros 
> NETIF_TX_[UN]LOCK
> We should remove these NET_POLL_CONTROLLER macros altogether for now.
>
> Though, I still have not looked comprehensively how other are able to 
> use
> Debugging utils like netconsole etc without having 
> NET_POLL_CONTROLLER.
> Maybe @Eric Dumazet might give us some insight on this?
>
>
> If you agree with this then I can send a patch to remove these from 
> hns
> driver. This should solve your problem as well?

Sure, as long as you can guarantee that these locks are never used for 
anything
useful.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ