netdev - Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmyN2XUjk7hp-7o0Em9b_6Y5S3iiS14KXQWSKUWJXnnOvA@mail.gmail.com>
Date: Wed, 7 May 2025 09:11:01 -0700
From: John Ousterhout <ouster@...stanford.edu>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, edumazet@...gle.com, horms@...nel.org, 
	kuba@...nel.org
Subject: Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c

On Mon, May 5, 2025 at 4:06 AM Paolo Abeni <pabeni@...hat.com> wrote:

> On 5/3/25 1:37 AM, John Ousterhout wrote:
> [...]
> > +{
> > +     /* Note: when we return, the object must be initialized so it's
> > +      * safe to call homa_peertab_destroy, even if this function returns
> > +      * an error.
> > +      */
> > +     int i;
> > +
> > +     spin_lock_init(&peertab->write_lock);
> > +     INIT_LIST_HEAD(&peertab->dead_dsts);
> > +     peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
> > +                                sizeof(*peertab->buckets));
>
> This struct looks way too big to be allocated on per netns basis. You
> should use a global table and include the netns in the lookup key.

Are there likely to be lots of netns's in a system? I thought I read
someplace that a hardware NIC must belong exclusively to a single
netns, so from that I assumed there couldn't be more than a few
netns's. Can there be virtual NICs, leading to lots of netns's? Can
you give me a ballpark number for how many netns's there might be in a
system with "lots" of them? This will be useful in making design
tradeoffs.

> > +     /* No existing entry; create a new one.
> > +      *
> > +      * Note: after we acquire the lock, we have to check again to
> > +      * make sure the entry still doesn't exist (it might have been
> > +      * created by a concurrent invocation of this function).
> > +      */
> > +     spin_lock_bh(&peertab->write_lock);
> > +     hlist_for_each_entry(peer, &peertab->buckets[bucket],
> > +                          peertab_links) {
> > +             if (ipv6_addr_equal(&peer->addr, addr))
> > +                     goto done;
> > +     }
> > +     peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
>
> Please, move the allocation outside the atomic scope and use GFP_KERNEL.

I don't think I can do that because homa_peer_find is invoked in
softirq code, which is atomic, no? It's not disastrous if the
allocation fails; the worst that happens is that an incoming packet
must be discarded (it will be retried later).

> > +     if (!peer) {
> > +             peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
> > +             goto done;
> > +     }
> > +     peer->addr = *addr;
> > +     dst = homa_peer_get_dst(peer, inet);
> > +     if (IS_ERR(dst)) {
> > +             kfree(peer);
> > +             peer = (struct homa_peer *)PTR_ERR(dst);
> > +             goto done;
> > +     }
> > +     peer->dst = dst;
> > +     hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
>
> At this point another CPU can lookup 'peer'. Since there are no memory
> barriers it could observe a NULL peer->dst.

Oops, good catch. I need to add 'smp_wmb()' just before the
hlist_add_head_rcu line?

> Also AFAICS new peers are always added when lookup for a different
> address fail and deleted only at netns shutdown time (never for the initns).

Correct.

> You need to:
> - account the memory used for peer
> - enforce an upper bound for the total number of peers (per netns),
> eventually freeing existing old ones.

OK, will do.

> Note that freeing the peer at 'runtime' will require additional changes:
> i.e. likely refcounting will be beeded or the at lookup time, after the
> rcu unlock the code could hit HaF while accessing the looked-up peer.

I understand about reference counting, but I couldn't parse the last
1.5 lines above. What is HaF?

> > +     dst = homa_peer_get_dst(peer, &hsk->inet);
> > +     if (IS_ERR(dst)) {
> > +             kfree(save_dead);
> /> +            return;
> > +     }
> > +
> > +     spin_lock_bh(&peertab->write_lock);
> > +     now = sched_clock();
>
> Use jiffies instead.

Will do, but this code will probably go away with the refactor to
manage homa_peer memory usage.

> > +     save_dead->dst = peer->dst;
> > +     save_dead->gc_time = now + 100000000;   /* 100 ms */
> > +     list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
> > +     homa_peertab_gc_dsts(peertab, now);
> > +     peer->dst = dst;
> > +     spin_unlock_bh(&peertab->write_lock);
>
> It's unclear to me why you need this additional GC layer on top's of the
> core one.

Now that you mention it, it's unclear to me as well. I think this will
go away in the refactor.

> [...]
> > +static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
> > +                                          struct homa_sock *hsk)
> > +{
> > +     if (unlikely(peer->dst->obsolete > 0))
>
> you need to additionally call dst->ops->check

I wasn't aware of dst->ops->check, and I'm a little confused by it
(usage in the kernel doesn't seem totally consistent):
* If I call dst->ops->check(), do I also need to check obsolete
(perhaps only call check if obsolete is true?)?
* What is the 'cookie' argument to dst->ops->check? Can I just use 0 safely?
* It looks like dst->ops->check now returns a struct dst_entry
pointer. What is the meaning of this? ChatGPT suggests that it is a
replacement dst_entry, if the original is no longer valid. If so, did
the check function release a reference on the original dst_entry
and/or take a reference on the new one? It looks like the return value
is just ignored in many cases, which would suggest that no references
have been taken or released.

-John-