lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 02 Feb 2021 12:02:09 -0800
From:   Jay Vosburgh <jay.vosburgh@...onical.com>
To:     Aichun Li <liaichun@...wei.com>
cc:     davem@...emloft.net, kuba@...nel.org, vfalico@...il.com,
        andy@...yhouse.net, netdev@...r.kernel.org, rose.chen@...wei.com,
        moyufeng <moyufeng@...wei.com>
Subject: Re: [PATCH net v2]bonding: check port and aggregator when select

Aichun Li <liaichun@...wei.com> wrote:

>When the network service is repeatedly restarted in 802.3ad, there is a low
> probability that oops occurs.
>Test commands:systemctl restart network
>
>1.crash: __enable_port():port->slave is NULL
[...]
>     PC: ffff000000e2fcd0  [ad_agg_selection_logic+328]
[...]
>2.I also have another call stack, same as in another person's post:
>https://lore.kernel.org/netdev/52630cba-cc60-a024-8dd0-8319e5245044@huawei.com/

	What hardware platform are you using here?

	moyufeng <moyufeng@...wei.com> appears to be using the same
platform, and I've not had any success so far with the provided script
to reproduce the issue.  I'm using an x86_64 system, however, so I
wonder if perhaps this platform needs a barrier somewhere that x86 does
not, or there's something different in the timing of the device driver
close logic.

>Signed-off-by: Aichun Li <liaichun@...wei.com>
>---
> drivers/net/bonding/bond_3ad.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index aa001b16765a..9c8894631bdd 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -183,7 +183,7 @@ static inline void __enable_port(struct port *port)
> {
> 	struct slave *slave = port->slave;
> 
>-	if ((slave->link == BOND_LINK_UP) && bond_slave_is_up(slave))
>+	if (slave && (slave->link == BOND_LINK_UP) && bond_slave_is_up(slave))
> 		bond_set_slave_active_flags(slave, BOND_SLAVE_NOTIFY_LATER);
> }

	This change seems like a band aid to cover the real problem.
The caller of __enable_port is ad_agg_selection_logic, and it shouldn't
be possible for port->slave to be NULL when assigned to an aggregator.

>@@ -1516,6 +1516,7 @@ static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
> 				  port->actor_port_number,
> 				  port->aggregator->aggregator_identifier);
> 		} else {
>+			port->aggregator = &(SLAVE_AD_INFO(slave)->aggregator);
> 			slave_err(bond->dev, port->slave->dev,
> 				  "Port %d did not find a suitable aggregator\n",
> 				  port->actor_port_number);

	This change isn't correct; it's assigning the port to a more or
less random aggregator.  This would eliminate the panic, but isn't doing
the right thing.  At this point in the code, the selection logic has
failed to find an aggregator that matches, and also failed to find a
free aggregator.

	I do need to fix up the failure handling here when it hits the
"did not find a suitable agg" case; the code here is simply wrong, and
has been wrong since the beginning.  I'll hack the driver to induce this
situation rather than reproducing whatever problem is making it unable
to find a suitable aggregator.

	-J

---
	-Jay Vosburgh, jay.vosburgh@...onical.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ