linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFz8rz=c8DsNHpBJW1K=FH-ztuPAMNOoGfM6HQyZByQ9mQ@mail.gmail.com>
Date:	Sat, 29 Jun 2013 15:23:48 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Dave Jones <davej@...hat.com>, Dave Chinner <david@...morbit.com>,
	Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones <davej@...hat.com> wrote:
>
> So with that patch, those two boxes have now been fuzzing away for
> over 24hrs without seeing that specific sync related bug.

Ok, so at least that confirms that yes, the problem is the excessive
contention on inode_sb_list_lock.

Ugh. There's no way we can do that patch by DaveC for 3.10. Not only
is it scary, Andi pointed out that it's actively buggy and will miss
inodes that need writeback due to moving things to private lists.

So I suspect we'll have to do 3.10 with this starvation issue in
place, and mark for stable backporting whatever eventual fix we find.

> I did see the trace below, but I think that's a different problem..
> Not sure who to point at for that one though. Linus?

Hmm.

> [ 1583.293952] RIP: 0010:[<ffffffff810dd856>]  [<ffffffff810dd856>] stop_machine_cpu_stop+0x86/0x110

I'm not sure how sane the watchdog is over stop_machine situations. I
think we disable the watchdog for suspend/resume exactly because
stop-machine can take almost arbitrarily long. I'm assuming you're
stress-testing (perhaps unintentionally) the cpu offlining/onlining
and/or memory migration, which is just fundamentally big expensive
things.

Does the machine recover? Because if it does, I'd be inclined to just
ignore it. Although it would be interesting to hear what triggers this
- normal users - and I'm assuming you're still running trinity as
non-root - generally should not be able to trigger stop-machine
events..

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/