linux-kernel - Re: [PATCH] nfsd: Using guard() to simplify nfsd_cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ea472081-a522-4c5a-8658-909793f460c8@talpey.com>
Date: Fri, 11 Jul 2025 13:20:56 -0400
From: Tom Talpey <tom@...pey.com>
To: Chuck Lever <chuck.lever@...cle.com>, NeilBrown <neil@...wn.name>
Cc: Su Hui <suhui@...china.com>, jlayton@...nel.org, okorniev@...hat.com,
 Dai.Ngo@...cle.com, linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org,
 kernel-janitors@...r.kernel.org
Subject: Re: [PATCH] nfsd: Using guard() to simplify nfsd_cache_lookup()

On 7/11/2025 10:22 AM, Chuck Lever wrote:
> On 6/24/25 6:15 PM, NeilBrown wrote:
>> On Wed, 25 Jun 2025, Chuck Lever wrote:
> 
>>> What is more interesting to me is trying out more sophisticated abstract
>>> data types for the DRC hashtable. rhashtable is one alternative; so is
>>> Maple tree, which is supposed to handle lookups with more memory
>>> bandwidth efficiency than walking a linked list.
>>>
>>
>> While I generally like rhashtable there is an awkwardness.  It doesn't
>> guarantee that an insert will always succeed.  If you get lots of new
>> records that hash to the same value, it will start failing insert
>> requests until is hash re-hashed the table with a new seed.
> 
> Hm. I hadn't thought of that.
> 
> 
>> This is
>> intended to defeat collision attacks.  That means we would need to drop
>> requests sometimes.  Maybe that is OK.  The DRC could be the target of
>> collision attacks so maybe we really do want to drop requests if
>> rhashtable refuses to store them.
> 
> Well I can imagine, in a large cohort of clients, there is a pretty good
> probability of non-malicious XID collisions due to the birthday paradox.
> 
> 
>> I think the other area that could use improvement is pruning old entries.
>> I would not include RC_INPROG entries in the lru at all - they are
>> always ignored, and will be added when they are switched to RCU_DONE.
> 
> That sounds intriguing.
> 
> 
>> I'd generally like to prune less often in larger batches, but removing
>> each of the batch from the rbtree could hold the lock for longer than we
>> would like.
> 
> Have a look at 8847ecc9274a ("NFSD: Optimize DRC bucket pruning").
> Pruning frequently by small amounts seems to have the greatest benefit.
> 
> It certainly does keep request latency jitter down, since NFSD prunes
> while the client is waiting. If we can move some management of the cache
> until after the reply is sent, that might offer opportunities to prune
> more aggressively without impacting server responsiveness.
> 
> 
>> I wonder if we could have an 'old' and a 'new' rbtree and
>> when the 'old' gets too old or the 'new' get too full, we extract 'old',
>> move 'new' to 'old', and outside the spinlock we free all of the moved
>> 'old'.
> 
> One observation I've had is that nearly every DRC lookup will fail to
> find an entry that matches the XID, because when things are operating
> smoothly, every incoming RPC contains an XID that hasn't been seen
> before.
> 
> That means DRC lookups are walking the entire bucket in the common
> case. Pointer chasing of any kind is a well-known ADT performance
> killer. My experience with the kernel's r-b tree is that is does not
> perform well due to the number of memory accesses needed for lookups.
> 
> This is why I suggested using rhashtable -- it makes an effort to keep
> bucket sizes small by widening the table frequently. The downside is
> that this will definitely introduce some latency when an insertion
> triggers a table-size change.
> 
> What might be helpful is a per-bucket Bloom filter that would make
> checking if an XID is in the hashed bucket an O(1) operation -- and
> in particular, would require few, if any, pointer dereferences.
> 
> 
>> But if we switched to rhashtable, we probably wouldn't need an lru -
>> just walk the entire table occasionally - there would be little conflict
>> with concurrent lookups.
> When the DRC is at capacity, pruning needs to find something to evict
> on every insertion. My thought is that a pruning walk would need to be
> done quite frequently to ensure clients don't overrun the cache. Thus
> attention needs to be paid to keep pruning efficient (although perhaps
> an LRU isn't the only choice here).

As a matter of fact, LRU is a *bad* choice for DRC eviction. It will
evict the very entries that are most important! The newest ones are
coming from clients that are working properly. The oldest ones were
from the ones still attempting to retry.

This is pretty subtle, and it may not be a simple thing to implement.
But at a high level, evicting entries from a client that is regularly
issuing new requests is a much safer strategy. Looking at the age of
the requests themselves, without considering their source, is much more
risky.

Tom.