linux-kernel - RE: [RFC] sunrpc: Fix race between work-queue and rpc_killall

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 8 Jul 2011 11:11:37 -0700
From:	"Myklebust, Trond" <Trond.Myklebust@...app.com>
To:	"Ben Greear" <greearb@...delatech.com>
Cc:	<linux-nfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: RE: [RFC] sunrpc:  Fix race between work-queue and rpc_killall_tasks.

> -----Original Message-----
> From: Ben Greear [mailto:greearb@...delatech.com]
> Sent: Friday, July 08, 2011 1:19 PM
> To: Myklebust, Trond
> Cc: linux-nfs@...r.kernel.org; linux-kernel@...r.kernel.org
> Subject: Re: [RFC] sunrpc: Fix race between work-queue and
> rpc_killall_tasks.
> 
> On 07/06/2011 04:45 PM, Trond Myklebust wrote:
> > On Wed, 2011-07-06 at 15:49 -0700, greearb@...delatech.com wrote:
> >> From: Ben Greear<greearb@...delatech.com>
> >>
> >> The rpc_killall_tasks logic is not locked against
> >> the work-queue thread, but it still directly modifies
> >> function pointers and data in the task objects.
> >>
> >> This patch changes the killall-tasks logic to set a flag
> >> that tells the work-queue thread to terminate the task
> >> instead of directly calling the terminate logic.
> >>
> >> Signed-off-by: Ben Greear<greearb@...delatech.com>
> >> ---
> >>
> >> NOTE:  This needs review, as I am still struggling to understand
> >> the rpc code, and it's quite possible this patch either doesn't
> >> fully fix the problem or actually causes other issues.  That said,
> >> my nfs stress test seems to run a bit more stable with this patch
> applied.
> >
> > Yes, but I don't see why you are adding a new flag, nor do I see why
> we
> > want to keep checking for that flag in the rpc_execute() loop.
> > rpc_killall_tasks() is not a frequent operation that we want to
> optimise
> > for.
> >
> > How about the following instead?
> 
> Ok, I looked at your patch closer.  I think it can still cause
> bad race conditions.
> 
> For instance:
> 
> Assume that tk_callback is NULL at beginning of while loop in
> __rpc_execute,
> and tk_action is rpc_exit_task.
> 
> While do_action(task) is being called, tk_action is set to NULL in
> rpc_exit_task.
> 
> But, right after tk_action is set to NULL in rpc_exit_task, the
> rpc_killall_tasks
> method calls rpc_exit, which sets tk_action back to rpc_exit_task.
> 
> I believe this could cause the xprt_release(task) logic to be called in
> the
> work-queue's execution of rpc_exit_task due to tk_action != NULL when
> it should not be.

Why would this be a problem? xprt_release() can certainly be called multiple times on an rpc_task. Ditto rpbc_getport_done.

The only thing that is not re-entrant there is rpcb_map_release, which should only ever be called once whether or not something calls rpc_killall_tasks.

 
> I have no hard evidence this exact scenario is happening in my case,
> but I
> believe the code is still racy with your patch.
> 
> For that matter, is it safe to modify the flags in rpc_killall_tasks:
> 
> rovr->tk_flags |= RPC_TASK_KILLED;
> 
> Is that guaranteed to be atomic with any other modification of flags?

Task->tk_flags should never change after the rpc_task is set up. The only allowed change is the RPC_TASK_KILLED. We could convert that into an atomic bit in task->tk_runstate, but again, this isn't something that is likely to be responsible for the problem you are seeing.