lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200430105551.GA4068275@splinter>
Date:   Thu, 30 Apr 2020 13:55:51 +0300
From:   Ido Schimmel <idosch@...sch.org>
To:     Stefan Priebe - Profihost AG <s.priebe@...fihost.ag>
Cc:     roopa@...ulusnetworks.com, nikolay@...ulusnetworks.com,
        davem@...emloft.net,
        "bridge@...ts.linux-foundation.org" 
        <bridge@...ts.linux-foundation.org>, netdev@...r.kernel.org
Subject: Re: BUG: soft lockup while deleting tap interface from vlan aware
 bridge

On Wed, Apr 29, 2020 at 10:52:35PM +0200, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> while running a stable vanilla kernel 4.19.115 i'm reproducably get this
> one:
> 
> watchdog: BUG: soft lockup - CPU#38 stuck for 22s! [bridge:3570653]
> 
> ...
> 
> Call
> Trace:nbp_vlan_delete+0x59/0xa0br_vlan_info+0x66/0xd0br_afspec+0x18c/0x1d0br_dellink+0x74/0xd0rtnl_bridge_dellink+0x110/0x220rtnetlink_rcv_msg+0x283/0x360

Nik, Stefan,

My theory is that 4K VLANs are deleted in a batch and preemption is
disabled (please confirm). For each VLAN the kernel needs to go over the
entire FDB and delete affected entries. If the FDB is very large or the
FDB lock is contended this can cause the kernel to loop for more than 20
seconds without calling schedule().

To reproduce I added mdelay(100) in br_fdb_delete_by_port() and ran
this:

ip link add name br10 up type bridge vlan_filtering 1
ip link add name dummy10 up type dummy
ip link set dev dummy10 master br10
bridge vlan add vid 1-4094 dev dummy10 master
bridge vlan del vid 1-4094 dev dummy10 master

Got a similar trace to Stefan's. Seems to be fixed by attached:

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index a774e19c41bb..240e260e3461 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -615,6 +615,7 @@ int br_process_vlan_info(struct net_bridge *br,
                                               v - 1, rtm_cmd);
                                v_change_start = 0;
                        }
+                       cond_resched();
                }
                /* v_change_start is set only if the last/whole range changed */
                if (v_change_start)

WDYT?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ