[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <175080335129.2280845.12285110458405652015@noble.neil.brown.name>
Date: Wed, 25 Jun 2025 08:15:51 +1000
From: "NeilBrown" <neil@...wn.name>
To: "Chuck Lever" <chuck.lever@...cle.com>
Cc: "Su Hui" <suhui@...china.com>, jlayton@...nel.org, okorniev@...hat.com,
Dai.Ngo@...cle.com, tom@...pey.com, linux-nfs@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-janitors@...r.kernel.org
Subject: Re: [PATCH] nfsd: Using guard() to simplify nfsd_cache_lookup()
On Wed, 25 Jun 2025, Chuck Lever wrote:
> On 6/23/25 8:19 PM, NeilBrown wrote:
> > On Mon, 23 Jun 2025, Su Hui wrote:
> >> Using guard() to replace *unlock* label. guard() makes lock/unlock code
> >> more clear. Change the order of the code to let all lock code in the
> >> same scope. No functional changes.
> >
> > While I agree that this code could usefully be cleaned up and that you
> > have made some improvements, I think the use of guard() is a nearly
> > insignificant part of the change. You could easily do exactly the same
> > patch without using guard() but having and explicit spin_unlock() before
> > the new return. That doesn't mean you shouldn't use guard(), but it
> > does mean that the comment explaining the change could be more usefully
> > focused on the "Change the order ..." part, and maybe explain what that
> > is important.
> >
> > I actually think there is room for other changes which would make the
> > code even better:
> > - Change nfsd_prune_bucket_locked() to nfsd_prune_bucket(). Have it
> > take the lock when needed, then drop it, then call
> > nfsd_cacherep_dispose() - and return the count.
> > - change nfsd_cache_insert to also skip updating the chain length stats
> > when it finds a match - in that case the "entries" isn't a chain
> > length. So just lru_put_end(), return. Have it return NULL if
> > no match was found
> > - after the found_entry label don't use nfsd_reply_cache_free_locked(),
> > just free rp. It has never been included in any rbtree or list, so it
> > doesn't need to be removed.
> > - I'd be tempted to have nfsd_cache_insert() take the spinlock itself
> > and call it under rcu_read_lock() - and use RCU to free the cached
> > items.
> > - put the chunk of code after the found_entry label into a separate
> > function and instead just return RC_REPLY (and maybe rename that
> > RC_CACHED). Then in nfsd_dispatch(), if RC_CACHED was returned, call
> > that function that has the found_entry code.
> >
> > I think that would make the code a lot easier to follow. Would you like
> > to have a go at that - I suspect it would be several patches - or shall
> > I do it?
>
> I'm going to counsel some caution.
>
> nfsd_cache_lookup() is a hot path. Source code readability, though
> important, is not the priority in this area.
>
> I'm happy to consider changes to this function, but the bottom line is
> patches need to be accompanied by data that show that proposed code
> modifications do not negatively impact performance. (Plus the usual
> test results that show no impact to correctness).
>
> That data might include:
> - flame graphs that show a decrease in CPU utilization
> - objdump output showing a smaller instruction cache footprint
> and/or short instruction path lengths
> - perf results showing better memory bandwidth
> - perf results showing better branch prediction
> - lockstat results showing less contention and/or shorter hold
> time on locks held in this path
>
> Macro benchmark results are also welcome: equal or lower latency for
> NFSv3 operations, and equal or higher I/O throughput.
>
> The benefit for the scoped_guard construct is that it might make it more
> difficult to add code that returns from this function with a lock held.
> However, so far that hasn't been an issue.
>
> Thus I'm not sure there's a lot of strong technical justification for
> modification of this code path. But, you might know of one -- if so,
> please make sure that appears in the patch descriptions.
>
> What is more interesting to me is trying out more sophisticated abstract
> data types for the DRC hashtable. rhashtable is one alternative; so is
> Maple tree, which is supposed to handle lookups with more memory
> bandwidth efficiency than walking a linked list.
>
While I generally like rhashtable there is an awkwardness. It doesn't
guarantee that an insert will always succeed. If you get lots of new
records that hash to the same value, it will start failing insert
requests until is hash re-hashed the table with a new seed. This is
intended to defeat collision attacks. That means we would need to drop
requests sometimes. Maybe that is OK. The DRC could be the target of
collision attacks so maybe we really do want to drop requests if
rhashtable refuses to store them.
I think the other area that could use improvement is pruning old entries.
I would not include RC_INPROG entries in the lru at all - they are
always ignored, and will be added when they are switched to RCU_DONE.
I'd generally like to prune less often in larger batches, but removing
each of the batch from the rbtree could hold the lock for longer than we
would like. I wonder if we could have an 'old' and a 'new' rbtree and
when the 'old' gets too old or the 'new' get too full, we extract 'old',
move 'new' to 'old', and outside the spinlock we free all of the moved
'old'.
But if we switched to rhashtable, we probably wouldn't need an lru -
just walk the entire table occasionally - there would be little conflict
with concurrent lookups.
But as you say, measuring would be useful. Hopefully the DRC lookup
would be small contribution to the total request time, so we would need
to measure just want happens in the code to compare different versions.
NeilBrown
> Anyway, have fun, get creative, and let's see what comes up.
>
>
> >> Signed-off-by: Su Hui <suhui@...china.com>
> >> ---
> >> fs/nfsd/nfscache.c | 99 ++++++++++++++++++++++------------------------
> >> 1 file changed, 48 insertions(+), 51 deletions(-)
> >>
> >> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> >> index ba9d326b3de6..2d92adf3e6b0 100644
> >> --- a/fs/nfsd/nfscache.c
> >> +++ b/fs/nfsd/nfscache.c
> >> @@ -489,7 +489,7 @@ int nfsd_cache_lookup(struct svc_rqst *rqstp, unsigned int start,
> >>
> >> if (type == RC_NOCACHE) {
> >> nfsd_stats_rc_nocache_inc(nn);
> >> - goto out;
> >> + return rtn;
> >> }
> >>
> >> csum = nfsd_cache_csum(&rqstp->rq_arg, start, len);
> >> @@ -500,64 +500,61 @@ int nfsd_cache_lookup(struct svc_rqst *rqstp, unsigned int start,
> >> */
> >> rp = nfsd_cacherep_alloc(rqstp, csum, nn);
> >> if (!rp)
> >> - goto out;
> >> + return rtn;
> >>
> >> b = nfsd_cache_bucket_find(rqstp->rq_xid, nn);
> >> - spin_lock(&b->cache_lock);
> >> - found = nfsd_cache_insert(b, rp, nn);
> >> - if (found != rp)
> >> - goto found_entry;
> >> - *cacherep = rp;
> >> - rp->c_state = RC_INPROG;
> >> - nfsd_prune_bucket_locked(nn, b, 3, &dispose);
> >> - spin_unlock(&b->cache_lock);
> >> + scoped_guard(spinlock, &b->cache_lock) {
> >> + found = nfsd_cache_insert(b, rp, nn);
> >> + if (found == rp) {
> >> + *cacherep = rp;
> >> + rp->c_state = RC_INPROG;
> >> + nfsd_prune_bucket_locked(nn, b, 3, &dispose);
> >> + goto out;
> >> + }
> >> + /* We found a matching entry which is either in progress or done. */
> >> + nfsd_reply_cache_free_locked(NULL, rp, nn);
> >> + nfsd_stats_rc_hits_inc(nn);
> >> + rtn = RC_DROPIT;
> >> + rp = found;
> >> +
> >> + /* Request being processed */
> >> + if (rp->c_state == RC_INPROG)
> >> + goto out_trace;
> >> +
> >> + /* From the hall of fame of impractical attacks:
> >> + * Is this a user who tries to snoop on the cache?
> >> + */
> >> + rtn = RC_DOIT;
> >> + if (!test_bit(RQ_SECURE, &rqstp->rq_flags) && rp->c_secure)
> >> + goto out_trace;
> >>
> >> + /* Compose RPC reply header */
> >> + switch (rp->c_type) {
> >> + case RC_NOCACHE:
> >> + break;
> >> + case RC_REPLSTAT:
> >> + xdr_stream_encode_be32(&rqstp->rq_res_stream, rp->c_replstat);
> >> + rtn = RC_REPLY;
> >> + break;
> >> + case RC_REPLBUFF:
> >> + if (!nfsd_cache_append(rqstp, &rp->c_replvec))
> >> + return rtn; /* should not happen */
> >> + rtn = RC_REPLY;
> >> + break;
> >> + default:
> >> + WARN_ONCE(1, "nfsd: bad repcache type %d\n", rp->c_type);
> >> + }
> >> +
> >> +out_trace:
> >> + trace_nfsd_drc_found(nn, rqstp, rtn);
> >> + return rtn;
> >> + }
> >> +out:
> >> nfsd_cacherep_dispose(&dispose);
> >>
> >> nfsd_stats_rc_misses_inc(nn);
> >> atomic_inc(&nn->num_drc_entries);
> >> nfsd_stats_drc_mem_usage_add(nn, sizeof(*rp));
> >> - goto out;
> >> -
> >> -found_entry:
> >> - /* We found a matching entry which is either in progress or done. */
> >> - nfsd_reply_cache_free_locked(NULL, rp, nn);
> >> - nfsd_stats_rc_hits_inc(nn);
> >> - rtn = RC_DROPIT;
> >> - rp = found;
> >> -
> >> - /* Request being processed */
> >> - if (rp->c_state == RC_INPROG)
> >> - goto out_trace;
> >> -
> >> - /* From the hall of fame of impractical attacks:
> >> - * Is this a user who tries to snoop on the cache? */
> >> - rtn = RC_DOIT;
> >> - if (!test_bit(RQ_SECURE, &rqstp->rq_flags) && rp->c_secure)
> >> - goto out_trace;
> >> -
> >> - /* Compose RPC reply header */
> >> - switch (rp->c_type) {
> >> - case RC_NOCACHE:
> >> - break;
> >> - case RC_REPLSTAT:
> >> - xdr_stream_encode_be32(&rqstp->rq_res_stream, rp->c_replstat);
> >> - rtn = RC_REPLY;
> >> - break;
> >> - case RC_REPLBUFF:
> >> - if (!nfsd_cache_append(rqstp, &rp->c_replvec))
> >> - goto out_unlock; /* should not happen */
> >> - rtn = RC_REPLY;
> >> - break;
> >> - default:
> >> - WARN_ONCE(1, "nfsd: bad repcache type %d\n", rp->c_type);
> >> - }
> >> -
> >> -out_trace:
> >> - trace_nfsd_drc_found(nn, rqstp, rtn);
> >> -out_unlock:
> >> - spin_unlock(&b->cache_lock);
> >> -out:
> >> return rtn;
> >> }
> >>
> >> --
> >> 2.30.2
> >>
> >>
> >
>
>
> --
> Chuck Lever
>
Powered by blists - more mailing lists