lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 29 Jun 2013 15:23:48 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Dave Jones <davej@...hat.com>, Dave Chinner <david@...morbit.com>,
	Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones <davej@...hat.com> wrote:
>
> So with that patch, those two boxes have now been fuzzing away for
> over 24hrs without seeing that specific sync related bug.

Ok, so at least that confirms that yes, the problem is the excessive
contention on inode_sb_list_lock.

Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
is it scary, Andi pointed out that it's actively buggy and will miss
inodes that need writeback due to moving things to private lists.

So I suspect we'll have to do 3.10 with this starvation issue in
place, and mark for stable backporting whatever eventual fix we find.

> I did see the trace below, but I think that's a different problem..
> Not sure who to point at for that one though. Linus?

Hmm.

> [ 1583.293952] RIP: 0010:[<ffffffff810dd856>]  [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110

I'm not sure how sane the watchdog is over stop_machine situations. I
think we disable the watchdog for suspend/resume exactly because
stop-machine can take almost arbitrarily long. I'm assuming you're
stress-testing (perhaps unintentionally) the cpu offlining/onlining
and/or memory migration, which is just fundamentally big expensive
things.

Does the machine recover? Because if it does, I'd be inclined to just
ignore it. Although it would be interesting to hear what triggers this
- normal users - and I'm assuming you're still running trinity as
non-root - generally should not be able to trigger stop-machine
events..

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ