[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <57620b85-544c-433e-87c9-f1bbd4a5f9b2@huawei.com>
Date: Wed, 3 Sep 2025 19:40:20 +0800
From: Li Lingfeng <lilingfeng3@...wei.com>
To: "zhangjian (CG)" <zhangjian496@...wei.com>, Li Lingfeng
<lilingfeng@...weicloud.com>, Benjamin Coddington <bcodding@...hat.com>
CC: Jeff Layton <jlayton@...nel.org>, <chuck.lever@...cle.com>,
<neil@...wn.name>, <okorniev@...hat.com>, <Dai.Ngo@...cle.com>,
<tom@...pey.com>, <linux-nfs@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <yukuai1@...weicloud.com>,
<houtao1@...wei.com>, <yi.zhang@...wei.com>, <yangerkun@...wei.com>
Subject: Re: [PATCH] nfsd: remove long-standing revoked delegations by force
Hi,
在 2025/9/3 18:06, zhangjian (CG) 写道:
>
> On 2025/9/3 14:45, Li Lingfeng wrote:
>> Hi,
>>
>> 在 2025/9/3 11:46, zhangjian (CG) 写道:
>>> Hello every experts.
>>>
>>> If we can see all delegations on hard-mounted nfs client, which are also
>>> on server cl_revoked list, changed from
>>> NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED|
>>> NFS_DELEGATION_TEST_EXPIRED
>>> to NFS_DELEGATION_RETURN_IF_CLOSED|NFS_DELEGATION_REVOKED, can we give
>>> some hypothesis on this problem ?
>>>
>>> By the way, this problem can be cover over by decreasing file count on
>>> server.
>>>
>>> Thanks,
>>> zhangjian
>> I think NFS_DELEGATION_TEST_EXPIRED is cleared as follows:
>> nfs4_state_manager
>> nfs4_do_reclaim
>> nfs4_reclaim_open_state
>> __nfs4_reclaim_open_state // get nfs4_state from sp->so_states
>> nfs41_open_expired // status = ops->recover_open
>> nfs41_check_delegation_stateid
>> test_and_clear_bit // NFS_DELEGATION_TEST_EXPIRED
>> After the bug in [1] is triggered, although the delegation is no longer on
>> server->delegations, it can still be obtained by traversing sp->so_states.
>> However, I cannot find the connection between the number of files on the
>> server and this issue.
>>
>> Thanks,
>> Lingfeng
>>
> Thanks a lot.
>
> NFS_DELEGATION_TEST_EXPIRED can only be set when
> delegation->stateid.type != NFS4_INVALID_STATEID_TYPE. But when
> NFS_DELEGATION_REVOKED is set, delegation->stateid.type will be
> NFS4_INVALID_STATEID_TYPE in nfs_mark_delegation_revoked.
> This implies the order could be like:
> 1. Deleg A is in server cl_revoked list
> 2. Deleg B is marked as NFS_DELEGATION_TEST_EXPIRED in client
> 3. Deleg B is revoked by server callback procedure and server meet [1].
> deleg B is added to cl_revoked list
> 4. Deleg B is marked as NFS_DELEGATION_REVOKED in client
I think Deleg A was added to the server's cl_revoked list due to [1]. For
the file corresponding to Deleg B, no access conflict occurred, which
means no deleg return was triggered. Therefore, unlike Deleg A, it would
not go through the process of nfs4_delegreturn_done -->
nfs_delegation_mark_returned --> nfs_mark_delegation_revoked to be set
with NFS4_INVALID_STATEID_TYPE, and thus could be flagged with
NFS_DELEGATION_TEST_EXPIRED.
> Why the first deleg A is in server cl_revoked list? Is [1] only
> condition? Why this can only happen when file count is large.
> I used to see 700 delegations in server but 40w+ delegations in client.
> May this give some clue on the problem?
I'm afraid I cannot explain why there is such a significant discrepancy in
the number of delegations between the client and the server. I truly don't
know what is happening.
Thanks,
Lingfeng
>>> On 2025/9/2 20:43, Benjamin Coddington wrote:
>>>> On 2 Sep 2025, at 8:10, Li Lingfeng wrote:
>>>>
>>>>> Our expected outcome was that the client would release the abnormal
>>>>> delegation via TEST_STATEID/FREE_STATEID upon detecting its invalidity.
>>>>> However, this problematic delegation is no longer present in the
>>>>> client's server->delegations list—whether due to client-side
>>>>> timeouts or
>>>>> the server-side bug [1].
>>>> How does the client timeout TEST_STATEID - are you mounting with 'soft'?
>>>>
>>>> We should find the server-side bug and fix it rather than write code to
>>>> paper over it. I do think the synchronization of state here is a bit
>>>> fragile and wish the protocol had a generation, sequence, or marker for
>>>> setting SEQ4_STATUS_ bits..
>>>>
>>>>>> Should we instead just administratively evict the client since it's
>>>>>> clearly not behaving right in this case?
>>>>> Thanks for the suggestion. While administratively evicting the
>>>>> client would
>>>>> certainly resolve the immediate delegation issue, I'm concerned that
>>>>> approach
>>>>> might be a bit heavy-handed.
>>>>> The problematic behavior seems isolated to a single delegation.
>>>>> Meanwhile,
>>>>> the client itself likely has numerous other open files and active
>>>>> state on
>>>>> the server. Forcing a complete client reconnect would tear down all
>>>>> that
>>>>> state, which could cause significant application disruption and be
>>>>> perceived
>>>>> as a service outage from the client's perspective.
>>>>>
>>>>> [1] https://lore.kernel.org/all/de669327-c93a-49e5-a53b-
>>>>> bda9e67d34a2@...wei.com/
>>>> ^^ in this thread you reference v5.10 - there was a knfsd fix for a
>>>> cl_revoked leak "3b816601e279", and there have been 3 or 4 fixes to fix
>>>> problems and optimize the client walk of delegations since then. Jeff
>>>> pointed out that there have been fixes in these areas. Are you
>>>> finding this
>>>> problem still with all those fixes included?
>>>>
>>>> Ben
>>>>
>>>>
>
Powered by blists - more mailing lists