[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39ef1522-9fad-45f8-9c73-ffba7b1f04d0@oracle.com>
Date: Fri, 11 Jul 2025 10:22:56 -0400
From: Chuck Lever <chuck.lever@...cle.com>
To: NeilBrown <neil@...wn.name>
Cc: Su Hui <suhui@...china.com>, jlayton@...nel.org, okorniev@...hat.com,
Dai.Ngo@...cle.com, tom@...pey.com, linux-nfs@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-janitors@...r.kernel.org
Subject: Re: [PATCH] nfsd: Using guard() to simplify nfsd_cache_lookup()
On 6/24/25 6:15 PM, NeilBrown wrote:
> On Wed, 25 Jun 2025, Chuck Lever wrote:
>> What is more interesting to me is trying out more sophisticated abstract
>> data types for the DRC hashtable. rhashtable is one alternative; so is
>> Maple tree, which is supposed to handle lookups with more memory
>> bandwidth efficiency than walking a linked list.
>>
>
> While I generally like rhashtable there is an awkwardness. It doesn't
> guarantee that an insert will always succeed. If you get lots of new
> records that hash to the same value, it will start failing insert
> requests until is hash re-hashed the table with a new seed.
Hm. I hadn't thought of that.
> This is
> intended to defeat collision attacks. That means we would need to drop
> requests sometimes. Maybe that is OK. The DRC could be the target of
> collision attacks so maybe we really do want to drop requests if
> rhashtable refuses to store them.
Well I can imagine, in a large cohort of clients, there is a pretty good
probability of non-malicious XID collisions due to the birthday paradox.
> I think the other area that could use improvement is pruning old entries.
> I would not include RC_INPROG entries in the lru at all - they are
> always ignored, and will be added when they are switched to RCU_DONE.
That sounds intriguing.
> I'd generally like to prune less often in larger batches, but removing
> each of the batch from the rbtree could hold the lock for longer than we
> would like.
Have a look at 8847ecc9274a ("NFSD: Optimize DRC bucket pruning").
Pruning frequently by small amounts seems to have the greatest benefit.
It certainly does keep request latency jitter down, since NFSD prunes
while the client is waiting. If we can move some management of the cache
until after the reply is sent, that might offer opportunities to prune
more aggressively without impacting server responsiveness.
> I wonder if we could have an 'old' and a 'new' rbtree and
> when the 'old' gets too old or the 'new' get too full, we extract 'old',
> move 'new' to 'old', and outside the spinlock we free all of the moved
> 'old'.
One observation I've had is that nearly every DRC lookup will fail to
find an entry that matches the XID, because when things are operating
smoothly, every incoming RPC contains an XID that hasn't been seen
before.
That means DRC lookups are walking the entire bucket in the common
case. Pointer chasing of any kind is a well-known ADT performance
killer. My experience with the kernel's r-b tree is that is does not
perform well due to the number of memory accesses needed for lookups.
This is why I suggested using rhashtable -- it makes an effort to keep
bucket sizes small by widening the table frequently. The downside is
that this will definitely introduce some latency when an insertion
triggers a table-size change.
What might be helpful is a per-bucket Bloom filter that would make
checking if an XID is in the hashed bucket an O(1) operation -- and
in particular, would require few, if any, pointer dereferences.
> But if we switched to rhashtable, we probably wouldn't need an lru -
> just walk the entire table occasionally - there would be little conflict
> with concurrent lookups.
When the DRC is at capacity, pruning needs to find something to evict
on every insertion. My thought is that a pruning walk would need to be
done quite frequently to ensure clients don't overrun the cache. Thus
attention needs to be paid to keep pruning efficient (although perhaps
an LRU isn't the only choice here).
--
Chuck Lever
Powered by blists - more mailing lists