netdev - Re: [PATCH 2/8] nfsd: fix CB_SEQUENCE error handling of NFS4ERR_{BADSLOT,BADSESSION,SEQ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8b4f46a-2c4b-43b3-bf82-dc5d8f6af171@oracle.com>
Date: Fri, 24 Jan 2025 10:31:55 -0500
From: Chuck Lever <chuck.lever@...cle.com>
To: Jeff Layton <jlayton@...nel.org>, Neil Brown <neilb@...e.de>,
        Olga Kornievskaia <okorniev@...hat.com>, Dai Ngo <Dai.Ngo@...cle.com>,
        Tom Talpey <tom@...pey.com>, "J. Bruce Fields" <bfields@...ldses.org>,
        Kinglong Mee <kinglongmee@...il.com>,
        Trond Myklebust <trondmy@...nel.org>, Anna Schumaker <anna@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>
Cc: linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org,
        netdev@...r.kernel.org
Subject: Re: [PATCH 2/8] nfsd: fix CB_SEQUENCE error handling of
 NFS4ERR_{BADSLOT,BADSESSION,SEQ_MISORDERED}

On 1/24/25 9:46 AM, Jeff Layton wrote:
> On Fri, 2025-01-24 at 09:32 -0500, Chuck Lever wrote:
>> On 1/23/25 3:25 PM, Jeff Layton wrote:
>>> The current error handling has some problems:
>>>
>>> BADSLOT and BADSESSION: don't release the slot before retrying the call
>>>
>>> SEQ_MISORDERED: does some sketchy resetting of the seqid? I can't find any
>>> recommendation about doing that in the spec, and it seems wrong.
>>
>> Random thought: You might use the Linux NFS client's forechannel session
>> implementation as a code reference.
>>
>>
>>> Handle all three errors the same way: release the slot, but then handle
>>> it just like we would as if we hadn't gotten a reply; mark the session
>>> as faulty, and retry the call.
>>
>> Some questions:
>>
>> Why does it matter whether NFSD keeps the slot if both sides plan to
>> destroy the session?
>>
> 
> It may not be required, but there is no reason to hold onto the slot in
> these cases.

In the BADSLOT case, if the slot is released, then another session
consumer on the NFS server can use it and will encounter the same error.
Best to keep it in the penalty box, IMO.

If there are other slots, they are likely still usable. An
implementation can choose to continue using the session rather than
scuttling it immediately. In the past, with a single backchannel slot,
NFSD had no choice but to replace the session. But now it can be more
conservative.


> Also, at this point, only nfsd has declared that it needs
> a new session (see below).

If the client's backchannel service has returned BADSESSION, then the
client already knows this session is unusable.


>> Also, AFAICT marking CB_FAULT does not destroy the session, it simply
>> tries to recreate backchannel's rpc_clnt. Perhaps NFSD's callback code
>> should actively destroy the session and let the client drive a fresh
>> CREATE_SESSION to recover?
>>
> 
> Marking it with a fault just sets the cl_cb_state to NFSD4_CB_FAULT.
> Then, on the next SEQUENCE call, that makes nfsd set
> SEQ4_STATUS_BACKCHANNEL_FAULT, which should make the client recreate
> the session. Obviously, there is some delay involved there since we
> might have to wait for the client to do a lease renewal before this
> happens.
> 
>>
>>> Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
>>> Signed-off-by: Jeff Layton <jlayton@...nel.org>
>>> ---
>>>    fs/nfsd/nfs4callback.c | 27 +++++++++++----------------
>>>    1 file changed, 11 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>> index e12205ef16ca932ffbcc86d67b0817aec2436c89..bfc9de1fcb67b4f05ed2f7a28038cd8290809c17 100644
>>> --- a/fs/nfsd/nfs4callback.c
>>> +++ b/fs/nfsd/nfs4callback.c
>>> @@ -1371,17 +1371,24 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
>>>    		nfsd4_mark_cb_fault(cb->cb_clp);
>>>    		ret = false;
>>>    		break;
>>> +	case -NFS4ERR_BADSESSION:
>>> +	case -NFS4ERR_BADSLOT:
>>> +	case -NFS4ERR_SEQ_MISORDERED:
>>> +		/*
>>> +		 * These errors indicate that something has gone wrong
>>> +		 * with the server and client's synchronization. Release
>>> +		 * the slot, but handle it as if we hadn't gotten a reply.
>>> +		 */
>>> +		nfsd41_cb_release_slot(cb);
>>> +		fallthrough;
>>>    	case 1:
>>>    		/*
>>>    		 * cb_seq_status remains 1 if an RPC Reply was never
>>>    		 * received. NFSD can't know if the client processed
>>>    		 * the CB_SEQUENCE operation. Ask the client to send a
>>> -		 * DESTROY_SESSION to recover.
>>> +		 * DESTROY_SESSION to recover, but keep the slot.
>>>    		 */
>>> -		fallthrough;
>>> -	case -NFS4ERR_BADSESSION:
>>>    		nfsd4_mark_cb_fault(cb->cb_clp);
>>> -		ret = false;
>>>    		goto need_restart;
>>>    	case -NFS4ERR_DELAY:
>>>    		cb->cb_seq_status = 1;
>>> @@ -1390,14 +1397,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
>>>    
>>>    		rpc_delay(task, 2 * HZ);
>>>    		return false;
>>> -	case -NFS4ERR_BADSLOT:
>>> -		goto retry_nowait;
>>> -	case -NFS4ERR_SEQ_MISORDERED:
>>> -		if (session->se_cb_seq_nr[cb->cb_held_slot] != 1) {
>>> -			session->se_cb_seq_nr[cb->cb_held_slot] = 1;
>>> -			goto retry_nowait;
>>> -		}
>>> -		break;
>>>    	default:
>>>    		nfsd4_mark_cb_fault(cb->cb_clp);
>>>    	}
>>> @@ -1405,10 +1404,6 @@ static bool nfsd4_cb_sequence_done(struct rpc_task *task, struct nfsd4_callback
>>>    	nfsd41_cb_release_slot(cb);
>>>    out:
>>>    	return ret;
>>> -retry_nowait:
>>> -	if (rpc_restart_call_prepare(task))
>>> -		ret = false;
>>> -	goto out;
>>>    need_restart:
>>>    	if (!test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) {
>>>    		trace_nfsd_cb_restart(clp, cb);
>>>
>>
>>
> 


-- 
Chuck Lever