linux-kernel - Re: [PATCH 1/8] nfsd: don't restart v4.1+ callback when RPC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac5834d4-1465-4dde-a451-b0804c537f04@oracle.com>
Date: Sun, 26 Jan 2025 11:41:44 -0500
From: Chuck Lever <chuck.lever@...cle.com>
To: Jeff Layton <jlayton@...nel.org>, NeilBrown <neilb@...e.de>
Cc: Olga Kornievskaia <okorniev@...hat.com>, Dai Ngo <Dai.Ngo@...cle.com>,
        Tom Talpey <tom@...pey.com>, "J. Bruce Fields" <bfields@...ldses.org>,
        Kinglong Mee <kinglongmee@...il.com>,
        Trond Myklebust <trondmy@...nel.org>, Anna Schumaker <anna@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
        linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org,
        netdev@...r.kernel.org
Subject: Re: [PATCH 1/8] nfsd: don't restart v4.1+ callback when RPC_SIGNALLED
 is set

On 1/26/25 6:18 AM, Jeff Layton wrote:
> On Sun, 2025-01-26 at 10:01 +1100, NeilBrown wrote:
>> On Fri, 24 Jan 2025, Jeff Layton wrote:
>>> This is problematic, since the RPC might have been entirely successful.
>>> There is no point in restarting a v4.1+ callback just because
>>> RPC_SIGNALLED is true. The v4.1+ error handling has other mechanisms for
>>> detecting when it should retransmit the RPC.
>>
>> But why might RPC_SIGNALLED() ever be true?
>> The flag RPC_TASK_SIGNALLED is only ever set by rpc_signal_task() which
>> is only called from rpc_killall_tasks() and __rpc_execute() for
>> non-async tasks which doesn't apply to nfsd callbacks as they are
>> started with rpc_call_async().
>>
>> rpc_killall_tasks() is called by fs/nfs/ which isn't relevant for us,
>> and from rpc_shutdown_client().  In those cases we certainly don't want
>> the request to be retried, though the nfsd4_process_cb_update() case is
>> a little interesting as we do want it to be retried, but in a different
>> client.
>>
>> So the code you are removing is either dead code because something else
>> will prevent the restart when a client is being shut down, or it is bad
>> code because it would delay rpc_shutdown_client() while the request is
>> retried.
>>
>> I haven't dug the extra step to figure out which, but either way I think
>> the code should go.
> 
> Thanks. That was my analysis too.

Agreed, this code is problematic, but it appears to me that some of
these problems are not resolved by simply removing the signal check.


> rpc_shutdown_client() is called when we tear down and rebuild the
> rpc_client. nfsd does this in setup_callback_client(), which gets
> called from nfsd4_process_cb_update() (basically when we detect that
> the backchannel is having problems).
> 
> There are really only two states: We either got a reply from the server
> before the client went down, or we didn't. In the case where we got a
> reply, there is no need to retry anything. In the case where we didn't,
> the tk_status will be '1', so there is no need to check RPC_SIGNALLED()
> at all here.

Fwiw, the "cb_seq_status == 1" arm skips the signal check in the current
code.


> The existing code could lead to the call being retried when we had
> already gotten a perfectly valid reply.

Here's a case-by-case audit:

  - NFS_OK: SEQUENCE was decoded and passed sanity checks. So this should
    not ever requeue in here. It might be requeued during subsequent
    processing.

  - ESERVERFAULT: SEQUENCE was decoded but failed sanity checking. The
    reply should be dropped now, and the session marked FAULT. No requeue
    is ever needed here.

    [ I question whether the sequence number should be bumped in this
      case -- the client's callback server replied with the identity of
      some other slot. And anyway, this session is about to become
      toast. ]

  - 1: The timeout case. We want a fresh session here, so it falls
    through to BADSESSION.

  - NFS4ERR_BADSESSION: This needs a requeue so that the operation can
    be retried with a fresh session. But it should always check if the
    rpc_clnt is shutting down before doing so. This is a problem in the
    current code.

  - NFS4ERR_DELAY: Skips the signal check, but shouldn't. If the rpc_clnt
    is shutting down, this RPC should not be requeued.

  - NFS4ERR_BAD_SLOT: Skips the signal check, but shouldn't. I need to
    think more about BAD_SLOT recovery best practice.

  - NFS4ERR_SEQ_MISORDERED: Does one retry with a seq_nr of 1. It
    probably should terminate if that fails. IMO this should check for
    rpc_clnt shutdown before requeuing the retry.

  - default: I don't think this case should ever be requeued, but it
    appears that it could be if the rpc_clnt is shutting down.


>>> Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
>>> Signed-off-by: Jeff Layton <jlayton@...nel.org>
>>> ---
>>>   fs/nfsd/nfs4callback.c | 3 ---
>>>   1 file changed, 3 deletions(-)
>>>
>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>> index 50e468bdb8d4838b5217346dcc2bd0fec1765c1a..e12205ef16ca932ffbcc86d67b0817aec2436c89 100644
>>> --- a/fs/nfsd/nfs4callback.c
>>> +++ b/fs/nfsd/nfs4callback.c
>>> @@ -1403,9 +1403,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
>>>   	}
>>>   	trace_nfsd_cb_free_slot(task, cb);
>>>   	nfsd41_cb_release_slot(cb);
>>> -
>>> -	if (RPC_SIGNALLED(task))
>>> -		goto need_restart;
>>>   out:
>>>   	return ret;
>>>   retry_nowait:
>>>
>>> -- 
>>> 2.48.1
>>>
>>>
>>
> 


-- 
Chuck Lever