[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1638.1495570412@famine>
Date: Tue, 23 May 2017 13:13:32 -0700
From: Jay Vosburgh <jay.vosburgh@...onical.com>
To: Nithin Sujir <nsujir@...tri.com>
cc: netdev@...r.kernel.org
Subject: Re: bond link state mismatch, rtnl_trylock() vs rtnl_lock()
Nithin Sujir <nsujir@...tri.com> wrote:
>Hi,
>We're encountering a problem in 4.4 LTS where, rarely, the bond link state
>is not updated when the slave link changes.
>
>I've traced the issue to the arp monitor unable to get the rtnl lock. The
>sequence resulting in failure is as below.
>
>bond_loadbalance_arp_mon() periodically called, if slave link is _down_,
>it checks if the slave is sending/receiving packets. If it is, it sets
>flags to be processed later down the function for bond link
>update. However, it sets the slave->link right away.
>
> if (slave->link != BOND_LINK_UP) {
> if (bond_time_in_interval(bond, trans_start, 1) &&
> bond_time_in_interval(bond, slave->last_rx,
>1)) {
>
> slave->link = BOND_LINK_UP;
> slave_state_changed = 1;
>
>
>Later down the function, it tries to get the rtnl_lock. If it doesn't get
>it, it rearms and returns.
>
> if (do_failover || slave_state_changed) {
> if (!rtnl_trylock())
> goto re_arm; <-- returns here
>
> if (slave_state_changed) {
> bond_slave_state_change(bond);
>
>This is the problem. The next time this function is called, the
>slave->link is already marked UP. And we will never update the bond link
>state to UP.
This looks like an ARP monitor version of
commit de77ecd4ef02ca783f7762e04e92b3d0964be66b
Author: Mahesh Bandewar <maheshb@...gle.com>
Date: Mon Mar 27 11:37:33 2017 -0700
bonding: improve link-status update in mii-monitoring
and probably needs a similar fix (possibly for both the
loadbalance and active-backup ARP monitor cases).
>Changing the rtnl_trylock() -> rtnl_lock() _does_ fix the issue.
>
>Is this the right way to fix it? If it is, I can submit this formally.
It's not the right way, unfortunately.
The reason for the rtnl_trylock is that there's a possible race
against bond_close() -> bond_work_cancel_all() trying to cancel the
arp_work workqueue item while it's running. bond_close is called with
RTNL held, so if it has RTNL and is waiting for the work function to
complete, an rtnl_lock call here will deadlock. Some of the trylock
calls in bonding are commented to this effect, but not this one.
-J
>What are the guidelines around using rtnl_lock() vs rtnl_trylock()? Some
>places are using rtnl_lock() and other rtnl_trylock(). Sorry, I couldn't
>find much via a google search or in Documentation/.
>
>Thanks,
>Nithin.
>
>--------------------
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 5dca77e..1f60503 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2614,8 +2614,7 @@ static void bond_loadbalance_arp_mon(struct
>work_struct *work)
> rcu_read_unlock();
>
> if (do_failover || slave_state_changed) {
>- if (!rtnl_trylock())
>- goto re_arm;
>+ rtnl_lock();
>
> if (slave_state_changed) {
> bond_slave_state_change(bond);
>
>
---
-Jay Vosburgh, jay.vosburgh@...onical.com
Powered by blists - more mailing lists