netdev - Re: [Bonding-devel] [PATCH] bonding: cancel_delayed_work() -> cancel_delayed_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <29926.1261078662@death.nxdomain.ibm.com>
Date:	Thu, 17 Dec 2009 11:37:42 -0800
From:	Jay Vosburgh <fubar@...ibm.com>
To:	Laurent Chavey <chavey@...gle.com>
cc:	Jarek Poplawski <jarkao2@...il.com>,
	Johannes Berg <johannes@...solutions.net>,
	Mikhail Markine <markine@...gle.com>, netdev@...r.kernel.org,
	bonding-devel@...ts.sourceforge.net,
	Petri Gynther <pgynther@...gle.com>,
	David Miller <davem@...emloft.net>
Subject: Re: [Bonding-devel] [PATCH] bonding: cancel_delayed_work() -> cancel_delayed_work_sync()

Laurent Chavey <chavey@...gle.com> wrote:

>one instance that could be a problem
>
>__exit bonding_exit(void)
>    bond_free_all()
>      	bond_work_cancel_all(bond);
>        unregister_netdevice(bond_dev)
>
>could the above result in an invalid pointer when trying
>to use bond-> in one of the timer CB ?

	The bonding teardown logic was reworked in October, and there is
no longer a bond_free_all in the current mainline.  What kernel are you
looking at?

	The bond_close function will stop the various work items, and
the ndo_uninit (bond_uninit) will call bond_work_cancel_all as well.

	Actually, on looking at it (it being current mainline),
bond_uninit might need some kind of logic to wait and insure that all
timers have completed before returning.  It comes from unregister, so
the next thing that happens after it returns is that the memory will be
freed (via netdev_run_todo, during rtnl_unlock, if I'm following it
correctly).

	The bond_uninit function is called under RTNL, though, so the
timer functions (bond_mii_monitor, et al) may need additional checks for
kill_timers to insure they don't attempt to acquire RTNL if a cancel is
pending.

	That's kind of tricky itself, since the lock ordering requires
RTNL to be acquired first, so there's no way for bond_mii_monitor (et
al) to check for kill_timers prior to already having RTNL (because the
function acquires RTNL conditionally, only if needed; to do that, it
unlocks the bond lock, then acquires RTNL, then re-locks the bond lock).

	So, the lock dance to acquire RTNL in bond_mii_monitor (et al)
would need some trickery, perhaps a rtnl_trylock loop, that checks
kill_timers each time the trylock fails, e.g.,

	if (bond_miimon_inspect(bond)) {
		read_unlock(&bond->lock);
		while (!rtnl_trylock) {
			read_lock(&bond->lock);
			if (bond->kill_timers)
				goto out;
			read_unlock(&bond->lock);
			/* msleep ? */
		}

		bond_miimon_commit(bond);
		[...]

	So, with the above (and similar changes to the other delayed
work functions, and a big honkin' comment somewhere to explain this), I
suspect that bond_work_cancel_all could use the _sync variant to cancel
the work, as long as kill_timers is set before the cancel_sync is
called.

	Am I missing anything?  Does this seem rational?

>On Thu, Dec 17, 2009 at 10:40 AM, Jarek Poplawski <jarkao2@...il.com> wrote:
>> On Thu, Dec 17, 2009 at 08:12:53AM -0800, Jay Vosburgh wrote:
>>>       There's already logic in the monitors (bond_mii_monitor, et al)
>>> to check a sentinel (kill_timers) and do nothing (not acquire rtnl) and
>>> return.
>>
>> Btw, this check should be repeated if bond->lock is given back and
>> re-acquired. I can't see these kill_timers used in bond_sysfs.c though.

	Yes, this is true, and I think that doing this in the above
manner may eliminate the problem.

>>>       What exactly is the nature of the race that doing cancel..sync
>>> is fixing?  The bond_close function sets kill_timers prior to calling
>>> the cancel functions, so the monitor function might run once, but it
>>> should do nothing.
>>
>> I guess there is a problem with destructions, but I hope Mikhail will
>> give more details.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html