netdev - Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmyizsc1Jk9VytJJ8OcCOTHoqUrgzGZUc1WDeanGyEV6pg@mail.gmail.com>
Date: Tue, 6 May 2025 16:28:38 -0700
From: John Ousterhout <ouster@...stanford.edu>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org, 
	kuba@...nel.org
Subject: Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c

On Mon, May 5, 2025 at 2:51 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> [...]
> > +/**
> > + * set_bpages_needed() - Set the bpages_needed field of @pool based
> > + * on the length of the first RPC that's waiting for buffer space.
> > + * The caller must own the lock for @pool->hsk.
> > + * @pool: Pool to update.
> > + */
> > +static void set_bpages_needed(struct homa_pool *pool)
> > +{
> > +     struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> > +                     struct homa_rpc, buf_links);
>
> Minor nit: please insert an empty line between variable declaration and
> code.

Done. For some reason checkpatch.pl doesn't complain about this (or
the next comment below). Until yesterday I wasn't aware of the
--strict argument to checkpatch.pl, which may explain why patchwork
was finding checkpatch errors even though I was running checkpatch. I
have now made a pass over all the Homa code to clean up --strict
issues. But even with --strict, checkpatch.pl doesn't complain about
the indentation problem above. Are there additional switches I should
be giving to checkpatch.pl in addition to --strict?

> > +     pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
> > +                     >> HOMA_BPAGE_SHIFT;
>
> Minor nit: please fix the indentation above

Fixed.

> > +/**
> > + * homa_pool_new() - Allocate and initialize a new homa_pool (it will have
> > + * no region associated with it until homa_pool_set_region is invoked).
> > + * @hsk:          Socket the pool will be associated with.
> > + * Return: A pointer to the new pool or a negative errno.
> > + */
> > +struct homa_pool *homa_pool_new(struct homa_sock *hsk)
>
> The proferrend name includes for an allocator includes 'alloc', not 'new'.

Got it. I have scanned the code base and replaced 'new' everywhere
with 'alloc'. I also replaced 'destroy' with 'free'.

> > +{
> > +     struct homa_pool *pool;
> > +
> > +     pool = kzalloc(sizeof(*pool), GFP_ATOMIC);
>
> You should try to use GFP_KERNEL allocation as much as you can, and use
> GFP_ATOMIC only in atomic context. If needed, try to move the function
> outside the atomic scope doing the allocation before acquiring the
> lock/rcu.

Will do. I was able to refactor the homa_pool code so that it doesn't
need GFP_ATOMIC.

> > +     pool->num_cores = nr_cpu_ids;
>
> The 'num_cores' field is likely not needed, and it's never used in this
> series.

Yep, that field is no longer used. I have deleted it.

> > +     pool->check_waiting_invoked = 0;
> > +
> > +     return 0;
> > +
> > +error:
> > +     kfree(pool->descriptors);
> > +     free_percpu(pool->cores);
>
> The above assumes that 'pool' will be zeroed at allocation time, but the
> allocator does not do that. You should probably add the __GFP_ZERO flag
> to the pool allocator.

The pool is allocated with kzalloc; that zeroes it, no?

> > +bool homa_bpage_available(struct homa_bpage *bpage, u64 now)
> > +{
> > +     int ref_count = atomic_read(&bpage->refs);
> > +
> > +     return ref_count == 0 || (ref_count == 1 && bpage->owner >= 0 &&
> > +                     bpage->expiration <= now);
>
> Minor nit: please fix the indentation above. Other cases below. Please
> validate your patch with the checkpatch.pl script.

I have been running checkpatch.pl, but as I mentioned above it doesn't
seem to be reporting everything.

> > +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
> > +                     int set_owner)
> > +{
> > +     int core_num = smp_processor_id();
> > +     struct homa_pool_core *core;
> > +     u64 now = sched_clock();
>
> From sched_clock() documentation:
>
> sched_clock() has no promise of monotonicity or bounded drift between
> CPUs, use (which you should not) requires disabling IRQs.
>
> Can't be used for an expiration time. You could use 'jiffies' instead,

Jiffies are *really* coarse grain (4 ms on my servers). It's possible
that I could make them work in this situation, but in general jiffies
won't work for Homa. Homa needs to make decisions at
microsecond-scale, and an RPC that takes one jiffy to complete is
completely unacceptable. Homa needs a fine-grain (e.g. cycle level)
clock that is monotonic and synchronous across cores, and as far as I
know, such a clock is available on every server where Homa is likely
to run. For example, I believe that the TSC counters on both Intel and
AMD chips have had the right properties for at least 10-15 years. And
isn't sched_clock based on TSC where it's available? So even though
sched_clock makes no official promises, isn't the reality actually
fine? Can I simply stipulate that Homa is not appropriate for any
machine where sched_clock doesn't have the properties Homa needs (this
won't be a significant limitation in practice)?

Ultimately I think Linux needs to bite the bullet and provide an
official fine-grain clock with ns precision.

> > +
> > +                     limit = pool->num_bpages
> > +                                     - atomic_read(&pool->free_bpages);
>
> Nit: indentation above, the operator should stay on the first line.

Fixed but, again, checkpatch.pl didn't report it.

> > +
> > +             /* Figure out whether this candidate is free (or can be
> > +              * stolen). Do a quick check without locking the page, and
> > +              * if the page looks promising, then lock it and check again
> > +              * (must check again in case someone else snuck in and
> > +              * grabbed the page).
> > +              */
> > +             if (!homa_bpage_available(bpage, now))
> > +                     continue;
>
> homa_bpage_available() accesses bpage without lock, so needs READ_ONCE()
> annotations on the relevant fields and you needed to add paied
> WRITE_ONCE() when updating them.

I think the code is safe as is. Even though some of the fields
accessed by homa_bpage_accessible are not atomic, it's not a disaster
if they return stale values, since homa_bpage_available is invoked
again after acquiring the lock before making any final decisions. The
worst that can happen is (a) skipping over a bpage that's actually
available or (b) acquiring the lock only to discover the bpage wasn't
actually available (and then skipping it). Neither of these is
problematic.

Also, I'm not sure what you mean by "READ_ONCE() annotations on the
relevant fields". Do I need something additional in the field
declaration, in addition to using READ_ONCE() and WRITE_ONCE() to
access the field?

>
> > +             if (!spin_trylock_bh(&bpage->lock))
>
> Why only trylock? I have a vague memory on some discussion on this point
> in a previous revision. You should at least add a comment here on in the
> commit message explaning why a plain spin_lock does not fit.

I think the reasoning is different here than in other situations we
may have discussed. I have added the following comment:
"Rather than wait for a locked page to become free, just go on to the
next page. If the page is locked, it probably won't turn out to be
available anyway."

> > +     /* The last chunk may be less than a full bpage; for this we use
> > +      * the bpage that we own (and reuse it for multiple messages).
> > +      */
> > +     partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
> > +     if (unlikely(partial == 0))
> > +             goto success;
> > +     core_id = smp_processor_id();
>
> Is this code running in non-preemptible scope? otherwise you need to use
> get_cpu() here and put_cpu() when you are done with 'core_id'.

Yes, it's non-preemptible since a spinlock is being held on the RPC.

>
> > +     (pool->cores);
> > +     bpage = &pool->descriptors[core->page_hint];
> > +     if (!spin_trylock_bh(&bpage->lock))
> > +             spin_lock_bh(&bpage->lock);
>
> I think I already commented on this pattern. Please don't use it.

Sorry, this is not intentional. It came about because the patches for
upstreaming are generated by extracting code from the "full" version
of Homa, removing things such as instrumentation code and
functionality that is not part of this patch series. The stripper is
not smart enough to recognize situations like this where the stripped
code, though technically correct, is nonsensical. I have to go in by
hand and add extra annotations to the source code so that the output
looks reasonable. I have now done that for this situation.

> > +
> > +     /* We get here if there wasn't enough buffer space for this
> > +      * message; add the RPC to hsk->waiting_for_bufs.
>
> Please also add a comment describing why waiting RPCs are sorted by
> message size.

Done. The list is sorted in order to implement the SRPT policy (give
priority to the shortest messages).

> > +             rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> > +                                    struct homa_rpc, buf_links);
> > +             if (!homa_rpc_try_lock(rpc)) {
> > +                     /* Can't just spin on the RPC lock because we're
> > +                      * holding the socket lock (see sync.txt). Instead,
>
> The documentation should live under:
>
> Documentation/networking/
>
> likely in its own subdir, and must be in restructured format.
>
> Here you should just mention that the lock acquiring order is rpc ->
> home sock lock.

I have updated the comment as you requested, and I'll reformat the
.txt files and move them to Documentation/networking/homa.

>
> > +                      * release the socket lock and try the entire
> > +                      * operation again.
> > +                      */
> > +                     homa_sock_unlock(pool->hsk);
> > +                     continue;
> > +             }
> > +             list_del_init(&rpc->buf_links);
> > +             if (list_empty(&pool->hsk->waiting_for_bufs))
> > +                     pool->bpages_needed = INT_MAX;
> > +             else
> > +                     set_bpages_needed(pool);
> > +             homa_sock_unlock(pool->hsk);
> > +             homa_pool_allocate(rpc);
>
> Why you don't need to check the allocation return value here?

There's no need to check the return value because if the allocation
couldn't be made, homa_pool_allocate automatically requeues the RPC.
The only time it returns an "error" is if there is no allocation
region. This should never happen in the first place, and if it does
the right response is simply to ignore the error and continue.

> > + * struct homa_bpage - Contains information about a single page in
> > + * a buffer pool.
> > + */
> > +struct homa_bpage {
> > +     union {
> > +             /**
> > +              * @cache_line: Ensures that each homa_bpage object
> > +              * is exactly one cache line long.
> > +              */
> > +             char cache_line[L1_CACHE_BYTES];
>
> Instead of the struct/union nesting just use ____cacheline_aligned

Done.

> [...]
> > +* Homa's approach means that socket shutdown and deletion can potentially
> > +  occur while operations are underway that hold RPC locks but not the socket
> > +  lock. This creates several potential problems:
> > +  * A socket might be deleted and its memory reclaimed while an RPC still
> > +    has access to it. Homa assumes that Linux will prevent socket deletion
> > +    while the kernel call is executing.
>
> This last sentence is not clear to me. Do you mean that the kernel
> ensures that the socket is freed after the close() syscall?

Apologies... this text is no longer accurate. A socket cannot have its
memory reclaimed until all RPCs associated with the socket have been
ended and reaped. I've revised that documentation so it now looks like
this:

* Homa's approach means that socket shutdown and deletion can potentially
  begin while operations are underway that hold RPC locks but not the socket
  lock. For example, a new RPC creation might be underway when a socket
  is shut down, which could attempt to add the new RPC after homa_sock_shutdown
  thinks it has deleted all RPCs. Handling this requires careful checking
  of hsk->shutdown. For example, during new RPC creation the socket lock
  must be acquired to add the new RPC to those for the socket; after acquiring
  the lock, it must check hsk->shutdown and abort the RPC creation if the
  socket has been shutdown.

A question for you: do socket-related kernel calls such as recvmsg
automatically take a reference on the socket or do something else to
protect it? I've been assuming that sockets can't go away during
callbacks such as those for recvmsg and sendmsg.

-John-