[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmydvaiY+0RNXLU-hdh1tYcTvUrvcuxWZTxsHbmWeTRSxw@mail.gmail.com>
Date: Mon, 1 Sep 2025 13:10:55 -0700
From: John Ousterhout <ouster@...stanford.edu>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org
Subject: Re: [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c
On Tue, Aug 26, 2025 at 4:31 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * homa_rpc_reap() - Invoked to release resources associated with dead
> > + * RPCs for a given socket.
> > + * @hsk: Homa socket that may contain dead RPCs. Must not be locked by the
> > + * caller; this function will lock and release.
> > + * @reap_all: False means do a small chunk of work; there may still be
> > + * unreaped RPCs on return. True means reap all dead RPCs for
> > + * hsk. Will busy-wait if reaping has been disabled for some RPCs.
> > + *
> > + * Return: A return value of 0 means that we ran out of work to do; calling
> > + * again will do no work (there could be unreaped RPCs, but if so,
> > + * they cannot currently be reaped). A value greater than zero means
> > + * there is still more reaping work to be done.
> > + */
> > +int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
> > +{
> > + /* RPC Reaping Strategy:
> > + *
> > + * (Note: there are references to this comment elsewhere in the
> > + * Homa code)
> > + *
> > + * Most of the cost of reaping comes from freeing sk_buffs; this can be
> > + * quite expensive for RPCs with long messages.
> > + *
> > + * The natural time to reap is when homa_rpc_end is invoked to
> > + * terminate an RPC, but this doesn't work for two reasons. First,
> > + * there may be outstanding references to the RPC; it cannot be reaped
> > + * until all of those references have been released. Second, reaping
> > + * is potentially expensive and RPC termination could occur in
> > + * homa_softirq when there are short messages waiting to be processed.
> > + * Taking time to reap a long RPC could result in significant delays
> > + * for subsequent short RPCs.
> > + *
> > + * Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead
> > + * RPCs are queued up and reaping occurs in this function, which is
> > + * invoked later when it is less likely to impact latency. The
> > + * challenge is to do this so that (a) we don't allow large numbers of
> > + * dead RPCs to accumulate and (b) we minimize the impact of reaping
> > + * on latency.
> > + *
> > + * The primary place where homa_rpc_reap is invoked is when threads
> > + * are waiting for incoming messages. The thread has nothing else to
> > + * do (it may even be polling for input), so reaping can be performed
> > + * with no latency impact on the application. However, if a machine
> > + * is overloaded then it may never wait, so this mechanism isn't always
> > + * sufficient.
> > + *
> > + * Homa now reaps in two other places, if reaping while waiting for
> > + * messages isn't adequate:
> > + * 1. If too may dead skbs accumulate, then homa_timer will call
> > + * homa_rpc_reap.
> > + * 2. If this timer thread cannot keep up with all the reaping to be
> > + * done then as a last resort homa_dispatch_pkts will reap in small
> > + * increments (a few sk_buffs or RPCs) for every incoming batch
> > + * of packets . This is undesirable because it will impact Homa's
> > + * performance.
> > + *
> > + * During the introduction of homa_pools for managing input
> > + * buffers, freeing of packets for incoming messages was moved to
> > + * homa_copy_to_user under the assumption that this code wouldn't be
> > + * on the critical path. However, there is evidence that with
> > + * fast networks (e.g. 100 Gbps) copying to user space is the
> > + * bottleneck for incoming messages, and packet freeing takes about
> > + * 20-25% of the total time in homa_copy_to_user. So, it may eventually
> > + * be desirable to remove packet freeing out of homa_copy_to_user.
>
> See skb_attempt_defer_free()
I wasn't previously aware of this. It looks useful, but unfortunately
its symbol isn't currently EXPORTed so Homa can't use it. I submitted
a patch to export that symbol, but that patch was rejected because the
patch didn't also include a use of the symbol.
I'm going to wait until this series is accepted, then submit a smaller
patch that adds the EXPORT and uses it in Homa (or maybe I'll wait
until I upstream Homa's GRO support, as Eric suggested).
> > + */
> > +#define BATCH_MAX 20
> > + struct homa_rpc *rpcs[BATCH_MAX];
> > + struct sk_buff *skbs[BATCH_MAX];
>
> A lot of bytes on the stack, and a quite large batch. You should probaly
> decrease it.
I have reduced the batch size to 10. Note also that this is a
"near-leaf" function, so it should be safe for it to have a larger
footprint than Homa functions that invoke the IP/driver stack, which
presumably takes a lot of stack space.
> Also it still feel suspect the need for just another tx free strategy on
> top of the several existing caches.
I wasn't able to identify an existing cache mechanism that could meet
Homa's needs (and given the association Homa introduces between skb's
and RPCs, which are Homa-specific, it seems unlikely that any existing
mechanism would work for Homa). But, if you have something in mind
that you think might work for Homa, let me know and I'll take a look.
> > + homa_sock_wakeup_wmem(hsk);
>
> Here num_rpcs can be zero, and you can have spurius wake-ups
I agree that num_rpcs can be zero, but homa_sock_wakeup_wmem won't
actually perform a wakeup unless (a) there are tasks waiting and (b)
there is available memory. So I don't see how there can be a spurious
wakeup. Is there something I'm missing?
> > +static inline void homa_rpc_hold(struct homa_rpc *rpc)
> > +{
> > + atomic_inc(&rpc->refs);
>
> `refs` should be a reference_t, since is uses as such.
Done.
-John-
Powered by blists - more mailing lists