[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150206171156.GA8942@axis.com>
Date: Fri, 6 Feb 2015 18:11:56 +0100
From: Rabin Vincent <rabin.vincent@...s.com>
To: Tejun Heo <tj@...nel.org>
Cc: Jesper Nilsson <jespern@...s.com>, linux-kernel@...r.kernel.org
Subject: Soft lockups in __cancel_work_timer()
Hi Tejun,
The comment at the top of the try_to_grab_pending() says that the ENOENT
state "may persist for arbitrarily long".
If two threads call cancel_delayed_work_sync() at the same time, one of
them can end up looping in the loop in __cancel_work_timer() without
waiting in the flush_work() which is inside it.
The looping thread will see -ENOENT from try_to_grab_pending() because
the work is being cancelled, and it will see a false return from
flush_work()/start_flush_work() because there is no worker executing the
work.
task1 task2 worker
add to busy hash
clear pending
exec work
__cancel_work_timer()
try_to_grab_pending()
get pending, return 0
set work cancelling
flush_work()
wait_for_completion()
remove from busy_hash
__cancel_work_timer()
while (forever) {
try_to_grab_pending()
return -ENOENT due to cancelling
flush_work
return false due to already gone
}
On kernels with CONFIG_PREEMPT disabled, this causes a permanent soft
lockup of the system.
Adding a cond_resched() to the loop in cancel_delayed_work_sync(), like
below, helps the simple !PREEMPT case, but does not completely fix the
problem, because the problem can still be seen if the thread which sees
the ENOENT has for example SCHED_FIFO scheduling class, both on systems
with CONFIG_PREEMPT enabled and on !PREEMPT with the cond_resched().
We've seen this problem with the cancel_delayed_work() call in
jffs2_sync_fs(), but I've attached a testing patch which forces the
problem on current mainline without the need for jffs2.
Suggestions?
Thanks.
/Rabin
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index beeeac9..904289f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2741,6 +2741,9 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
*/
if (unlikely(ret == -ENOENT))
flush_work(work);
+
+ if (ret < 0)
+ cond_resched();
} while (unlikely(ret < 0));
/* tell other tasks trying to grab @work to back off */
View attachment "force-wq-lockup.patch" of type "text/x-diff" (3668 bytes)
Powered by blists - more mailing lists