lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 12 Sep 2008 15:01:35 -0700
From:	john stultz <johnstul@...ibm.com>
To:	Ulrich Windl <ulrich.windl@...uni-regensburg.de>,
	Thomas Gleixner <tglx@...utronix.de>, mingo <mingo@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>
Cc:	Dinakar Guniguntala <dino@...ibm.com>,
	Ankita Garg <ankita@...ibm.com>,
	Darren Hart <dvhltc@...ibm.com>,
	Sripathi Kodi <sripathi@...ibm.com>,
	lkml <linux-kernel@...r.kernel.org>
Subject: [BUG -rt] Priority inversion deadlock caused by condvars

	So we've been seeing application hangs with a very threaded (~8k
threads) realtime java test. After a fair amount of debugging we found
most of the SCHED_FIFO threads are blocked in futex_wait(). This raised
some alarm, since futex_wait isn't priority-inheritance aware.

After seeing what was going on, Dino came up with a possible deadlock
case in the pthread_cond_wait() code.

The problem, as I understand it, assuming there is only one cpu, is if a
low priority thread is going to call pthread_cond_wait(), it takes the
associated PI mutex, and calls the function. The glibc implementation
acquires the condvar's internal non-PI lock, releases the PI mutex and
tries to block on futex_wait().

However if a medium priority cpu hog, and a high priority start up while
the low priority thread holds the mutex, the low priority thread will be
boosted until it releases that mutex, but not long enough for it to
release the condvar's internal lock (since the internal lock is not
priority inherited). 

Then the high priority thread will aquire the mutex, and try to acquire
the condvar's internal lock (which is still held). However, since we
also have a medium prio cpu hog, it will block the low priority thread
from running, and thus block it from releasing the lock.

And then we're deadlocked.

Thomas mentioned this is a known problem, but I wanted to send this
example out so maybe others might become aware.

The attached test illustrates this hang as described above when bound to
a single cpu. I believe its correct, but these sorts of tests often have
their own bugs that create false positives, so please forgive me and let
me know if you see any problems. :)

Many thanks to Dino, Ankita and Sripathi for helping to sort out this
issue.

To run:
	./pthread_cond_hang               => will PASS (on SMP)
	taskset -c 0 ./pthread_cond_hang  => will HANG


thanks
-john

View attachment "pthread_cond_hang.c" of type "text/x-csrc" (3876 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ