[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20130531050316.GC7720@mtj.dyndns.org>
Date: Fri, 31 May 2013 14:03:16 +0900
From: Tejun Heo <tj@...nel.org>
To: "weiqi@...inos.com.cn" <weiqi@...inos.com.cn>
Cc: torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: race condition in schedule_on_each_cpu()
On Fri, May 31, 2013 at 12:07:15PM +0800, weiqi@...inos.com.cn wrote:
>
> >the only way for them to get stuck is if there aren't enough execution
> >resources (ie. if a new thread can't be created) but OOM killers would
> >have been activated if that were the case.
>
> The following is a detailed description of our scenerio:
>
> 1. after turnning off the the disk array, the ps results is shown
> in *ps*, which indicates the kworker/1:0 kworker/1:2 are stuck
>
> 2. the call stack for the kworkers are shown in *stack_xxx.txt*
>
> 3. the workqueue operations during that period is shown in
> *out.txt*, use ftrace
> (we added a new trace point /workqueue_queue_work_insert/,
> immediately before insert_wq_barrier, in the function
> start_flush_work. its implementation is shown in
> *trace_insert_wq_barrier.txt*)
> from the results int *grep_kwork1:0_from_out.txt*, we can see:
> kworker/1:0 is stuck after start work
> /fc_starget_delete/ at time 360.801271, and catch the
> insert_wq_barrier trace_info behind this
>
>
> 4. from out.txt , we can see, there are altogether three
> /fc_starget_delete/ work enqueued.
> atfer the point of deadlock, kworker/1:1 and kworker/1:3 is
> executing ...
>
>
> 5. if we let the scsi_transport_fc uses only one worker thread,
> i.e., change scsi_transport_fc.c : fc_host_setup()
> alloc_workqueue(fc_host->work_q_name, 0, 0) to
> alloc_workqueue(fc_host->work_q_name, WQ_UNBOUND, 1)
>
> alloc_workqueue(fc_host->devloss_work_q_name, 0, 0) to
> alloc_workqueue(fc_host->devloss_work_q_name, WQ_UNBOUND, 1)
>
> the deadlock won't occur.
> >Can you please test a recent kernel? How easily can you reproduce the
> >issue?
> >
> it's occured every time when hot remove disk array.
>
> I'll test recent kernel after a while , but this problem in 3.0.30
> really confused me
Yeah, it definitely sounds like concurrency depletion. There have
been some fixes and substantial changes in the area, so I really wanna
find out whether the problem is reproducible in recent vanilla kernel
- say, v3.9 or, even better, v3.10-rc2. Can you please try to
reproduce the problem with a newer kernel?
> by the way, I'm wondering about what's the race condition before
> which doesn't exist now
Before the commit you originally quoted, the calling thread could be
preempted and migrated to another CPU before get_online_cpus() thus
ending up executing the function twice on the new cpu but skipping the
old one.
Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists