[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmyN2XUjk7hp-7o0Em9b_6Y5S3iiS14KXQWSKUWJXnnOvA@mail.gmail.com>
Date: Wed, 7 May 2025 09:11:01 -0700
From: John Ousterhout <ouster@...stanford.edu>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org
Subject: Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
On Mon, May 5, 2025 at 4:06 AM Paolo Abeni <pabeni@...hat.com> wrote:
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> [...]
> > +{
> > + /* Note: when we return, the object must be initialized so it's
> > + * safe to call homa_peertab_destroy, even if this function returns
> > + * an error.
> > + */
> > + int i;
> > +
> > + spin_lock_init(&peertab->write_lock);
> > + INIT_LIST_HEAD(&peertab->dead_dsts);
> > + peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
> > + sizeof(*peertab->buckets));
>
> This struct looks way too big to be allocated on per netns basis. You
> should use a global table and include the netns in the lookup key.
Are there likely to be lots of netns's in a system? I thought I read
someplace that a hardware NIC must belong exclusively to a single
netns, so from that I assumed there couldn't be more than a few
netns's. Can there be virtual NICs, leading to lots of netns's? Can
you give me a ballpark number for how many netns's there might be in a
system with "lots" of them? This will be useful in making design
tradeoffs.
> > + /* No existing entry; create a new one.
> > + *
> > + * Note: after we acquire the lock, we have to check again to
> > + * make sure the entry still doesn't exist (it might have been
> > + * created by a concurrent invocation of this function).
> > + */
> > + spin_lock_bh(&peertab->write_lock);
> > + hlist_for_each_entry(peer, &peertab->buckets[bucket],
> > + peertab_links) {
> > + if (ipv6_addr_equal(&peer->addr, addr))
> > + goto done;
> > + }
> > + peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
>
> Please, move the allocation outside the atomic scope and use GFP_KERNEL.
I don't think I can do that because homa_peer_find is invoked in
softirq code, which is atomic, no? It's not disastrous if the
allocation fails; the worst that happens is that an incoming packet
must be discarded (it will be retried later).
> > + if (!peer) {
> > + peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
> > + goto done;
> > + }
> > + peer->addr = *addr;
> > + dst = homa_peer_get_dst(peer, inet);
> > + if (IS_ERR(dst)) {
> > + kfree(peer);
> > + peer = (struct homa_peer *)PTR_ERR(dst);
> > + goto done;
> > + }
> > + peer->dst = dst;
> > + hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
>
> At this point another CPU can lookup 'peer'. Since there are no memory
> barriers it could observe a NULL peer->dst.
Oops, good catch. I need to add 'smp_wmb()' just before the
hlist_add_head_rcu line?
> Also AFAICS new peers are always added when lookup for a different
> address fail and deleted only at netns shutdown time (never for the initns).
Correct.
> You need to:
> - account the memory used for peer
> - enforce an upper bound for the total number of peers (per netns),
> eventually freeing existing old ones.
OK, will do.
> Note that freeing the peer at 'runtime' will require additional changes:
> i.e. likely refcounting will be beeded or the at lookup time, after the
> rcu unlock the code could hit HaF while accessing the looked-up peer.
I understand about reference counting, but I couldn't parse the last
1.5 lines above. What is HaF?
> > + dst = homa_peer_get_dst(peer, &hsk->inet);
> > + if (IS_ERR(dst)) {
> > + kfree(save_dead);
> /> + return;
> > + }
> > +
> > + spin_lock_bh(&peertab->write_lock);
> > + now = sched_clock();
>
> Use jiffies instead.
Will do, but this code will probably go away with the refactor to
manage homa_peer memory usage.
> > + save_dead->dst = peer->dst;
> > + save_dead->gc_time = now + 100000000; /* 100 ms */
> > + list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
> > + homa_peertab_gc_dsts(peertab, now);
> > + peer->dst = dst;
> > + spin_unlock_bh(&peertab->write_lock);
>
> It's unclear to me why you need this additional GC layer on top's of the
> core one.
Now that you mention it, it's unclear to me as well. I think this will
go away in the refactor.
> [...]
> > +static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
> > + struct homa_sock *hsk)
> > +{
> > + if (unlikely(peer->dst->obsolete > 0))
>
> you need to additionally call dst->ops->check
I wasn't aware of dst->ops->check, and I'm a little confused by it
(usage in the kernel doesn't seem totally consistent):
* If I call dst->ops->check(), do I also need to check obsolete
(perhaps only call check if obsolete is true?)?
* What is the 'cookie' argument to dst->ops->check? Can I just use 0 safely?
* It looks like dst->ops->check now returns a struct dst_entry
pointer. What is the meaning of this? ChatGPT suggests that it is a
replacement dst_entry, if the original is no longer valid. If so, did
the check function release a reference on the original dst_entry
and/or take a reference on the new one? It looks like the return value
is just ignored in many cases, which would suggest that no references
have been taken or released.
-John-
Powered by blists - more mailing lists