[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070107125603.GA74@tv-sign.ru>
Date: Sun, 7 Jan 2007 15:56:03 +0300
From: Oleg Nesterov <oleg@...sign.ru>
To: Srivatsa Vaddagiri <vatsa@...ibm.com>
Cc: Andrew Morton <akpm@...l.org>, David Howells <dhowells@...hat.com>,
Christoph Hellwig <hch@...radead.org>,
Ingo Molnar <mingo@...e.hu>,
Linus Torvalds <torvalds@...l.org>,
linux-kernel@...r.kernel.org, Gautham shenoy <ego@...ibm.com>
Subject: Re: [PATCH] fix-flush_workqueue-vs-cpu_dead-race-update
On 01/07, Srivatsa Vaddagiri wrote:
>
> On Sat, Jan 06, 2007 at 08:34:16PM +0300, Oleg Nesterov wrote:
> > I suspect this can't help either.
> >
> > The problem is that flush_workqueue() may be called while cpu hotplug event
> > in progress and CPU_DEAD waits for kthread_stop(), so we have the same dead
> > lock if work->func() does flush_workqueue(). This means that Andrew's change
> > to use preempt_disable() is good and anyway needed.
>
> Well ..a lock_cpu_hotplug() in run_workqueue() and support for recursive
> calls to lock_cpu_hotplug() by the same thread will avoid the problem
> you mention.
Srivatsa, I'm completely new to cpu-hotplug, so please correct me if I'm
wrong (in fact I _hope_ I am wrong) but as I see it, the hotplug/workqueue
interaction is broken by design, it can't be fixed by changing just locking.
Once again. CPU dies, CPU_DEAD calls kthread_stop() and sleeps until
cwq->thread exits. To do so, this thread must at least complete the
currently running work->func().
work->func() calls flush_workque(WQ), it does lock_cpu_hotplug() or
_whatever_. Now the question, does it block?
if YES:
This is what the stable tree does - deadlock.
if NOT:
This is what we have with Andrew's s/mutex_lock/preempt_disable/
patch - race or deadlock, we have a choice.
Suppose that WQ has pending works on that dead CPU. Note that
at this point this CPU does not present on cpu_online_map.
This means that (without other changes) we have lost.
- flush_workque(WQ) can't return until CPU_DEAD transfers
these works to some another CPU on the cpu_online_map.
- CPU_DEAD can't do take_over_work() untill flush_workque()
returns.
Andrew, Ingo, this also means that freezer can't solve this particular
problem either (if i am right).
Thoughts?
Oleg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists