linux-kernel - Re: nfsd oops on Linus' current tree.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0EC8763B847DB24D9ADF5EBD9CD7B4191259F54D@SACEXCMBX02-PRD.hq.netapp.com>
Date:	Thu, 3 Jan 2013 20:27:20 +0000
From:	"Adamson, Dros" <Weston.Adamson@...app.com>
To:	"J. Bruce Fields" <bfields@...ldses.org>
CC:	"Adamson, Dros" <Weston.Adamson@...app.com>,
	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	Tejun Heo <tj@...nel.org>
Subject: Re: nfsd oops on Linus' current tree.


On Jan 3, 2013, at 3:11 PM, J. Bruce Fields <bfields@...ldses.org> wrote:

> On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
>> Hey, sorry for the late response, I've been on vacation.
>> 
>> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@...ldses.org>
>> wrote:
>> 
>>> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>>>> Please reread what I said. There was no obvious circular
>>>> dependency, because nfsiod and rpciod are separate workqueues, both
>>>> created with WQ_MEM_RECLAIM.
>>> 
>>> Oh, sorry, I read "rpciod" as "nfsiod"!
>>> 
>>>> Dros' experience shows, however that a call to rpc_shutdown_client
>>>> in an nfsiod work item will deadlock with rpciod if the RPC task's
>>>> work item has been assigned to the same CPU as the one running the
>>>> rpc_shutdown_client work item.
>>>> 
>>>> I can't tell right now if that is intentional (in which case the
>>>> WARN_ON in the rpc code is correct), or if it is a bug in the
>>>> workqueue code. For now, we're assuming the former.
>>> 
>>> Well, Documentation/workqueue.txt says:
>>> 
>>> 	"Each wq with WQ_MEM_RECLAIM set has an execution context
>>> 	reserved for it.  If there is dependency among multiple work
>>> 	items used during memory reclaim, they should be queued to
>>> 	separate wq each with WQ_MEM_RECLAIM."
>> 
>> The deadlock we were seeing was:
>> 
>> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
>> task B gets queued on rpciod workqueue and assigned the same kworker
>> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
>> will loop forever waiting for task B to run rpc_async_release() - task
>> B will never run rpc_async_release() - it can't run until kworker-0:0
>> is free, which won't happen until task A (rpc_shutdown_client) is done
>> 
>> The same deadlock happened when we tried queuing the tasks on a
>> different workqueues -- queue_work() assigns the task to a kworker
>> thread and it's luck of the draw if it's the same kworker as task A.
>> We tried the different workqueue options, but nothing changed this
>> behavior.
>> 
>> Once a work struct is queued, there is no way to back out of the
>> deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:
>> 
>> * Currently, a queued barrier can't be canceled.  This is because *
>> try_to_grab_pending() can't determine whether the work to be *
>> grabbed is at the head of the queue and thus can't clear LINKED *
>> flag of the previous work while there must be a valid next work *
>> after a work with LINKED flag set.
>> 
>> So once a work struct is queued and there is an ordering dependency
>> (i.e. task A is before task B), there is no way to back task B out -
>> so we can't just call cancel_work() or something on task B in
>> rpc_shutdown_client.
>> 
>> The root of our issue is that rpc_shutdown_client is never safe to
>> call from a workqueue context - it loops until there are no more
>> tasks, marking tasks as killed and waiting for them to be cleaned up
>> in each task's own workqueue context. Any tasks that have already been
>> assigned to the same kworker thread will never have a chance to run
>> this cleanup stage.
>> 
>> When fixing this deadlock, Trond and I discussed changing how
>> rpc_shutdown_client works (making it workqueue safe), but Trond felt
>> that it'd be better to just not call it from a workqueue context and
>> print a warning if it is.
>> 
>> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
>> success), but I'd argue that even if that did work it would still be
>> very easy to call rpc_shutdown_client from the wrong context and MUCH
>> harder to detect it.  It's also unclear to me if setting rpciod
>> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
> 
> Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

Heh, oh yeah :)

> 
> But, right, looking at kernel/workqueue.c, it seems that the dedicated
> "rescuer" threads are invoked only in the case when work is stalled
> because a new worker thread isn't allocated quickly enough.
> 
> So, what to do that's simplest enough that it would work for
> post-rc2/stable?  I was happy having just a simple dedicated
> thread--these are only started when nfsd is, so there's no real thread
> proliferation problem.

That should work fine. The client went this way and spawns a new kthread before calling rpc_shutdown_client() from a workqueue context.  This should happen very infrequently.

-dros

> 
> I'll go quietly weep for a little while and then think about it some
> more....



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/