lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1401261500250.5335@chino.kir.corp.google.com>
Date:	Sun, 26 Jan 2014 15:04:15 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Oleg Nesterov <oleg@...hat.com>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: [RFC] wait*() induced tasklist_lock starvation

Hi Oleg,

We've found that it's pretty easy to cause NMI watchdog timeouts due to 
tasklist_lock starvation by using repeated wait4(), waitid(), or waitpid() 
since it takes the readside of the lock and cascading calls to the 
syscalls from multiple processes will starve anything in the fork() or 
exit() path that is waiting on the writeside with irqs disabled.

The only way I've been able to remedy this problem is by serializing the 
taking of the readside of this lock with a spinlock specifically for these 
syscalls, otherwise my testcase will panic any machine if we panic on 
these NMI watchdog timeouts, which we do.

Is there any way we can do this in a less expensive way?  Or is it just 
another case of tasklist_lock problems that needs a major overhaul?
---
 kernel/exit.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,14 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 
+/*
+ * Ensures the wait family of syscalls -- wait4(), waitid(), and waitpid() --
+ * don't cascade taking readside of tasklist_lock which will starve processes
+ * doing fork() or exit() and cause NMI watchdog timeouts with interrupts
+ * disabled.
+ */
+static DEFINE_SPINLOCK(wait_lock);
+
 static void exit_mm(struct task_struct * tsk);
 
 static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -1028,6 +1036,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 
 		get_task_struct(p);
 		read_unlock(&tasklist_lock);
+		spin_unlock(&wait_lock);
 		if ((exit_code & 0x7f) == 0) {
 			why = CLD_EXITED;
 			status = exit_code >> 8;
@@ -1112,6 +1121,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 	 * thread can reap it because we set its state to EXIT_DEAD.
 	 */
 	read_unlock(&tasklist_lock);
+	spin_unlock(&wait_lock);
 
 	retval = wo->wo_rusage
 		? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
@@ -1246,6 +1256,7 @@ unlock_sig:
 	pid = task_pid_vnr(p);
 	why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
 	read_unlock(&tasklist_lock);
+	spin_unlock(&wait_lock);
 
 	if (unlikely(wo->wo_flags & WNOWAIT))
 		return wait_noreap_copyout(wo, p, pid, uid, why, exit_code);
@@ -1308,6 +1319,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
 	pid = task_pid_vnr(p);
 	get_task_struct(p);
 	read_unlock(&tasklist_lock);
+	spin_unlock(&wait_lock);
 
 	if (!wo->wo_info) {
 		retval = wo->wo_rusage
@@ -1523,6 +1535,7 @@ repeat:
 		goto notask;
 
 	set_current_state(TASK_INTERRUPTIBLE);
+	spin_lock(&wait_lock);
 	read_lock(&tasklist_lock);
 	tsk = current;
 	do {
@@ -1538,6 +1551,7 @@ repeat:
 			break;
 	} while_each_thread(current, tsk);
 	read_unlock(&tasklist_lock);
+	spin_unlock(&wait_lock);
 
 notask:
 	retval = wo->notask_error;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ