[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140625140005.GB26883@htj.dyndns.org>
Date: Wed, 25 Jun 2014 10:00:05 -0400
From: Tejun Heo <tj@...nel.org>
To: Austin Schuh <austin@...oton-tech.com>
Cc: Dave Chinner <david@...morbit.com>, xfs <xfs@....sgi.com>,
Thomas Gleixner <tglx@...utronix.de>,
linux-kernel@...r.kernel.org
Subject: Re: On-stack work item completion race? (was Re: XFS crash?)
Hello,
On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote:
> > I can see no reason why manual completion would behave differently
> > from flush_work() in this case.
>
> I went looking for a short trace in my original log to show the problem,
> and instead found evidence of the second problem. I still like the shorter
> flush_work call, but that's not my call.
So, are you saying that the original issue you reported isn't actually
a problem? But didn't you imply that changing the waiting mechanism
fixed a deadlock or was that a false positive?
> I did find this comment in the process_one_work function. Sounds like this
> could be better documented.
Yeah, we prolly should beef up Documentation/workqueue.txt with
information on general usage.
> I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is
> returning 1 in sched_submit_work (kernel/sched/core.c). It looks
> like sched_submit_work is not detecting that the worker task is blocked on
> a mutex.
The function unplugs the block layer and doesn't have much to do with
workqueue although it has "_work" in its name.
> This looks very RT related right now. I see 2 problems from my reading
> (and experimentation). The first is that the second worker isn't getting
> started because tsk_is_pi_blocked is reporting that the task isn't blocked
> on a mutex. The second is that even if another worker needs to be
> scheduled because the original worker is blocked on a mutex, we need the
> pool lock to schedule another worker. The pool lock can be acquired by any
> CPU, and is a spin_lock. If we end up on the slow path for the pool lock,
> we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on))
> in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with
> either problem.
>
> Hopefully I've got all my facts right... Debugging kernel code is a whole
> new world from userspace code.
I don't have much idea how RT kernel works either. Can you reproduce
the issues that you see on mainline?
Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists