[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmw95dDUxUFNa7UjV3XRd66vQRByAP5T_zra6KWdavr2Pg@mail.gmail.com>
Date: Fri, 24 Jan 2025 15:53:55 -0800
From: John Ousterhout <ouster@...stanford.edu>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org
Subject: Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
On Thu, Jan 23, 2025 at 4:06 AM Paolo Abeni <pabeni@...hat.com> wrote:
...
> > + pool->descriptors = kmalloc_array(pool->num_bpages,
> > + sizeof(struct homa_bpage),
> > + GFP_ATOMIC);
>
> Possibly wort adding '| __GFP_ZERO' and avoid zeroing some fields later.
I prefer to do all the initialization explicitly (this makes it
totally clear that a zero value is intended, as opposed to accidental
omission of an initializer). If you still think I should use
__GFP_ZERO, let me know and I'll add it.
> > +
> > + /* Allocate and initialize core-specific data. */
> > + pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
> > + GFP_ATOMIC);
>
> Uhm... on large system this could be an order-3 allocation, which in
> turn could fail quite easily under memory pressure, and it looks
> contradictory with WRT the cover letter statement about reducing the
> amount of per socket status.
>
> Why don't you use alloc_percpu_gfp() here?
I have now switched to alloc_percpu_gfp. On the issue of per-socket
memory requirements, Homa doesn't significantly reduce the amount of
memory allocated for any given socket. Its memory savings come about
because a single Homa socket can be used to communicate with any
number of peers simultaneously, whereas TCP requires a separate socket
for each peer-to-peer connection. I have added a bit more to the cover
letter to clarify this.
> > +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
> > + int set_owner)
> > +{
> > + int core_num = raw_smp_processor_id();
>
> Why the 'raw' variant? If this code is pre-emptible it means another
> process could be scheduled on the same core...
My understanding is that raw_smp_processor_id is faster.
homa_pool_get_pages is invoked with a spinlock held, so there is no
risk of a core switch while it is executing. Is there some other
problem I have missed?
> > +
> > + cur = core->next_candidate;
> > + core->next_candidate++;
>
> ... here, making this increment racy.
Because this code always runs in atomic mode, I don't believe there is
any danger of racing: no other thread can run on the same core
concurrently.
> > + if (cur >= limit) {
> > + core->next_candidate = 0;
> > +
> > + /* Must recompute the limit for each new loop through
> > + * the bpage array: we may need to consider a larger
> > + * range of pages because of concurrent allocations.
> > + */
> > + limit = 0;
> > + continue;
> > + }
> > + bpage = &pool->descriptors[cur];
> > +
> > + /* Figure out whether this candidate is free (or can be
> > + * stolen). Do a quick check without locking the page, and
> > + * if the page looks promising, then lock it and check again
> > + * (must check again in case someone else snuck in and
> > + * grabbed the page).
> > + */
> > + ref_count = atomic_read(&bpage->refs);
> > + if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
> > + bpage->expiration > now)))
>
> The above conditions could be place in separate helper, making the code
> more easy to follow and avoiding some duplications.
Done; I've created a new function homa_bpage_available.
> > + /* First allocate any full bpages that are needed. */
> > + full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
> > + if (unlikely(full_pages)) {
> > + if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
>
> full_pages must be less than HOMA_MAX_BPAGES, but I don't see any check
> on incoming message length to be somewhat limited ?!?
Oops, good catch. There was a check in the outbound path, but not in
the inbound path. I have added one now (in homa_message_in_init in
homa_incoming.c).
> > +
> > + /* We get here if there wasn't enough buffer space for this
> > + * message; add the RPC to hsk->waiting_for_bufs.
> > + */
> > +out_of_space:
> > + homa_sock_lock(pool->hsk, "homa_pool_allocate");
>
> There is some chicken-egg issue, with homa_sock_lock() being defined
> only later in the series, but it looks like the string argument is never
> used.
Right: in normal usage this argument is ignored. It exists because
there are occasionally deadlocks involving socket locks; when that
happens I temporarily add code to homa_sock_lock that uses this
argument to help track them down. I'd prefer to keep it, even though
it isn't normally used, because otherwise when a new deadlock arises
I'd have to modify every call to homa_sock_lock in order to add the
information back in again. I added a few more words to the comment for
homa_sock_lock to make this more clear.
> > + if (!homa_rpc_try_lock(rpc, "homa_pool_check_waiting")) {
> > + /* Can't just spin on the RPC lock because we're
> > + * holding the socket lock (see sync.txt). Instead,
>
> Stray reference to sync.txt. It would be nice to have the locking schema
> described somewhere start to finish in this series.
sync.txt will be part of the next revision of this series.
> > +struct homa_bpage {
> > + union {
> > + /**
> > + * @cache_line: Ensures that each homa_bpage object
> > + * is exactly one cache line long.
> > + */
> > + char cache_line[L1_CACHE_BYTES];
> > + struct {
> > + /** @lock: to synchronize shared access. */
> > + spinlock_t lock;
> > +
> > + /**
> > + * @refs: Counts number of distinct uses of this
> > + * bpage (1 tick for each message that is using
> > + * this page, plus an additional tick if the @owner
> > + * field is set).
> > + */
> > + atomic_t refs;
> > +
> > + /**
> > + * @owner: kernel core that currently owns this page
> > + * (< 0 if none).
> > + */
> > + int owner;
> > +
> > + /**
> > + * @expiration: time (in sched_clock() units) after
> > + * which it's OK to steal this page from its current
> > + * owner (if @refs is 1).
> > + */
> > + __u64 expiration;
> > + };
>
> ____cacheline_aligned instead of inserting the struct into an union
> should suffice.
Done (but now that alloc_percpu_gfp is being used I'm not sure this is
needed to ensure alignment?).
-John-
Powered by blists - more mailing lists