netdev - Re: [Bugme-new] [Bug 7974] New: BUG: scheduling while atomic: swapper/0x10000100/0

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200702140211.l1E2BF22031298@death.nxdomain.ibm.com>
Date:	Tue, 13 Feb 2007 18:11:15 -0800
From:	Jay Vosburgh <fubar@...ibm.com>
To:	Andy Gospodarek <andy@...yhouse.net>
cc:	David Miller <davem@...emloft.net>, akpm@...ux-foundation.org,
	netdev@...r.kernel.org, shemminger@...ux-foundation.org,
	lpiccilli@...re.com.br, bugme-daemon@...zilla.kernel.org
Subject: Re: [Bugme-new] [Bug 7974] New: BUG: scheduling while atomic: swapper/0x10000100/0 

Andy Gospodarek <andy@...yhouse.net> wrote:

>This is exactly the problem I've got, Jay.  I'd love to come up with
>something that will be a smaller patch to solve this in the near term
>and then focus on a larger set of changes down the road but it doesn't
>seem likely.

	That's more or less the conclusion I've reached as well.  The
locking model needs to be redone from scratch; there are too many
players and the conditions have changed since things were originally
done (in terms of side effects, particularly the places that may sleep
today that didn't in the past).  It's tricky to redo the bulk of the
innards in small modular steps.

>> 	Andy, one thought: do you think it would work better to simplify
>> the locking that is there first, i.e., convert the timers to work
>> queues, have a single dispatcher that handles everything (and can be
>> suspended for mutexing purposes), as in the patch I sent you?  The
>> problem isn't just rtnl; there also has to be a release of the bonding
>> locks themselves (to handle the might sleep issues), and that's tricky
>> to do with so many entities operating concurrently.  Reducing the number
>> of involved parties should make the problem simpler.
>> 
>
>I really don't feel like there are that many operations happening
>concurrently, but having a workqueue that managed and dispatched the
>operations and detected current link status would probably be helpful
>for long term maintenance.  It would probably be wise to have individual
>workqueues that managed any mode-specific operations, so their processing
>doesn't interfere with any link-checking operations.

	I like having the mode-specific monitors simply be special case
monitors in a unified "monitor system." It resolves the link-check
vs. mode-specific conflict, and allows all of the periodic things to be
paused as a set for mutexing purposes.  I'm happy to be convinced that
I'm just blowing hooey here, but that's how it seems to me.

	For balance-alb (the biggest offender in terms of locking
violations), the link monitor, alb mode monitor, transmit activity, and
user initiated stuff (add or remove slave, for example) all need to be
mutexed against one another.  The user initiated stuff comes in with
rtnl and the link monitor needs rtnl if it has to fail over.  All of
them need the regular bonding lock, but the user initiated stuff and the
link monitor both need to do things (change mac addresses, call
dev_open) with that lock released.

	Do you have any work in progress patches or descendents of the
big rework thing I sent you?  I need to get back to this, and whatever
we do it's probably better if we're at least a little bit coordinated.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html