linux-kernel - Re: nfsd oops on Linus' current tree.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0EC8763B847DB24D9ADF5EBD9CD7B4191259E4A2@SACEXCMBX02-PRD.hq.netapp.com>
Date:	Thu, 3 Jan 2013 16:28:37 +0000
From:	"Adamson, Dros" <Weston.Adamson@...app.com>
To:	"J. Bruce Fields" <bfields@...ldses.org>
CC:	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	"Adamson, Dros" <Weston.Adamson@...app.com>,
	Tejun Heo <tj@...nel.org>
Subject: Re: nfsd oops on Linus' current tree.

Hey, sorry for the late response, I've been on vacation.

On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <bfields@...ldses.org> wrote:

> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>> Please reread what I said. There was no obvious circular dependency,
>> because nfsiod and rpciod are separate workqueues, both created with
>> WQ_MEM_RECLAIM.
> 
> Oh, sorry, I read "rpciod" as "nfsiod"!
> 
>> Dros' experience shows, however that a call to
>> rpc_shutdown_client in an nfsiod work item will deadlock with rpciod
>> if the RPC task's work item has been assigned to the same CPU as the
>> one running the rpc_shutdown_client work item.
>> 
>> I can't tell right now if that is intentional (in which case the
>> WARN_ON in the rpc code is correct), or if it is a bug in the
>> workqueue code. For now, we're assuming the former.
> 
> Well, Documentation/workqueue.txt says:
> 
> 	"Each wq with WQ_MEM_RECLAIM set has an execution context
> 	reserved for it.  If there is dependency among multiple work
> 	items used during memory reclaim, they should be queued to
> 	separate wq each with WQ_MEM_RECLAIM."

The deadlock we were seeing was:

- task A gets queued on rpciod workqueue and assigned kworker-0:0
- task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
- task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
- task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done

The same deadlock happened when we tried queuing the tasks on a different workqueues -- queue_work() assigns the task to a kworker thread and it's luck of the draw if it's the same kworker as task A.  We tried the different workqueue options, but nothing changed this behavior.

Once a work struct is queued, there is no way to back out of the deadlock.  From kernel/workqueue.c:insert_wq_barrier comment:

 * Currently, a queued barrier can't be canceled.  This is because
 * try_to_grab_pending() can't determine whether the work to be
 * grabbed is at the head of the queue and thus can't clear LINKED
 * flag of the previous work while there must be a valid next work
 * after a work with LINKED flag set.

So once a work struct is queued and there is an ordering dependency (i.e. task A is before task B), there is no way to back task B out - so we can't just call cancel_work() or something on task B in rpc_shutdown_client.

The root of our issue is that rpc_shutdown_client is never safe to call from a workqueue context - it loops until there are no more tasks, marking tasks as killed and waiting for them to be cleaned up in each task's own workqueue context. Any tasks that have already been assigned to the same kworker thread will never have a chance to run this cleanup stage.

When fixing this deadlock, Trond and I discussed changing how rpc_shutdown_client works (making it workqueue safe), but Trond felt that it'd be better to just not call it from a workqueue context and print a warning if it is.

IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no success), but I'd argue that even if that did work it would still be very easy to call rpc_shutdown_client from the wrong context and MUCH harder to detect it.  It's also unclear to me if setting rpciod workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

-dros

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/