linux-kernel - Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1391138977.6284.82.camel@j-VirtualBox>
Date:	Thu, 30 Jan 2014 19:29:37 -0800
From:	Jason Low <jason.low2@...com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	mingo@...hat.com, paulmck@...ux.vnet.ibm.com, Waiman.Long@...com,
	torvalds@...ux-foundation.org, tglx@...utronix.de,
	linux-kernel@...r.kernel.org, riel@...hat.com,
	akpm@...ux-foundation.org, davidlohr@...com, hpa@...or.com,
	andi@...stfloor.org, aswin@...com, scott.norton@...com,
	chegu_vinod@...com
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to
 spin_on_owner if need_resched() triggered while queued

On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > > But urgh, nasty problem. Lemme ponder this a bit.
> 
> OK, please have a very careful look at the below. It survived a boot
> with udev -- which usually stresses mutex contention enough to explode
> (in fact it did a few time when I got the contention/cancel path wrong),
> however I have not ran anything else on it.

I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver
workload. After 100 users, the system gets soft lockups.

Some condition may be causing threads to not leave the "goto unqueue"
loop. I added a debug counter, and threads were able to reach more than
1,000,000,000 "goto unqueue".

I also was initially thinking if there can be problems when multiple
threads need_resched() and unqueue at the same time. As an example, 2
nodes that need to reschedule are next to each other in the middle of
the MCS queue. The 1st node executes "while (!(next =
ACCESS_ONCE(node->next)))" and exits the while loop because next is not
NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node,
NULL) != node)". We may then end up in a situation where the node before
the 1st node gets linked with the outdated 2nd node.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/