[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAVpQUC7HkbpVjYdG0q17uwXw0mH8z1nutyKyGcH9YD-CTwH6A@mail.gmail.com>
Date: Fri, 1 Aug 2025 09:46:37 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Menglong Dong <menglong8.dong@...il.com>
Cc: Eric Dumazet <edumazet@...gle.com>, kraig@...gle.com, ncardwell@...gle.com,
davem@...emloft.net, dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
horms@...nel.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net v2] net: ip: order the reuseport socket in __inet_hash
On Fri, Aug 1, 2025 at 3:42 AM Menglong Dong <menglong8.dong@...il.com> wrote:
>
> On Fri, Aug 1, 2025 at 5:46 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Fri, Aug 1, 2025 at 2:09 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > >
> > > For now, the socket lookup will terminate if the socket is reuse port in
> > > inet_lhash2_lookup(), which makes the socket is not the best match.
> > >
> > > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > > Then, all connections will goto socket2, which is not expected, as socket1
> > > has higher priority.
> > >
> > > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > > in Documentation/networking/vrf.rst -> Applications.
> > >
> > > Therefore, we compute a score for the reuseport socket and add it to the
> > > list with order in __inet_hash(). Sockets with high score will be added
> > > to the head.
> > >
> > > Link: https://lore.kernel.org/netdev/20250731123309.184496-1-dongml2@chinatelecom.cn/
> > > Signed-off-by: Menglong Dong <dongml2@...natelecom.cn>
> >
> > You forgot a Fixes: tag, and a selftest.
>
> I was not sure if it should be a Fixes, I'll add it in the next version.
> Kuniyuki's test case is nice. Should I put the selftests in the
> commit log?
The python example is handy and easy to understand the
issue, so feel free to add it to the commit log if needed.
But please add a separate patch to add a test under
tools/testing/selftest/net/.
It will help us not introduce regression in the future as it's
run for each patch by NIPA CI.
>
> >
> > > ---
> > > v2:
> > > - As Kuniyuki advised, sort the reuseport socket in __inet_hash() to keep
> > > the lookup for reuseport O(1)
> >
> > Keeping sorted the list is difficult, we would have to intercept
> > SO_BINDTODEVICE, SO_BINDTOIFINDEX, SO_INCOMING_CPU.
> >
> > This also makes the patch risky to backport to stable versions,
> > because it is complex and possibly buggy.
> >
> > Therefore I prefer your first approach.
>
> Kuniyuki also has a similar patch:
> https://lore.kernel.org/netdev/CADxym3ZY7Lm9mgv83e2db7o3ZZMcLDa=vDf6nJSs1m0_tUk5Bg@mail.gmail.com/T/#m56ee67b2fdf85ce568fd1339def92c53232d5b49
>
> Will his be better and stable? Kuniyuki say the first approach
> kill the O(1) lookup for reuseport socket :/
At least your compute_reuseport_score() is wrong;
so_incoming_cpu is not considered to group reuseport
sockets, it does not take wildcard and ipv6_only into
account, etc..
And I agree this is net-next material, a bit risky to backport.
Once net-next is open, I'll follow up to restore the O(1)
lookup with a few more patches to handle corner cases
that I mentioned in v1 thread.
>
> Anyway, I'll send a V3 with the first approach, and with
> the Fixes + selftests
nit: The subject prefix should start with "tcp:" as UDP
and SCTP do not seem to have this issue.
>
> Thanks!
> Menglong Dong
>
> >
> > > ---
> > > include/linux/rculist_nulls.h | 34 ++++++++++++++++++++++++
> > > include/net/sock.h | 5 ++++
> > > net/ipv4/inet_hashtables.c | 49 ++++++++++++++++++++++++++++++++---
> > > 3 files changed, 84 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> > > index 89186c499dd4..da500f4ae142 100644
> > > --- a/include/linux/rculist_nulls.h
> > > +++ b/include/linux/rculist_nulls.h
> > > @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
> > > #define hlist_nulls_next_rcu(node) \
> > > (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
> > >
> > > +/**
> > > + * hlist_nulls_pprev_rcu - returns the element of the list after @node.
> > > + * @node: element of the list.
> > > + */
> > > +#define hlist_nulls_pprev_rcu(node) \
> > > + (*((struct hlist_nulls_node __rcu __force **)&(node)->pprev))
> > > +
> > > /**
> > > * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
> > > * @n: the element to delete from the hash list.
> > > @@ -145,6 +152,33 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
> > > }
> > > }
> > >
> > > +/**
> > > + * hlist_nulls_add_before_rcu
> > > + * @n: the new element to add to the hash list.
> > > + * @next: the existing element to add the new element before.
> > > + *
> > > + * Description:
> > > + * Adds the specified element to the specified hlist
> > > + * before the specified node while permitting racing traversals.
> > > + *
> > > + * The caller must take whatever precautions are necessary
> > > + * (such as holding appropriate locks) to avoid racing
> > > + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
> > > + * or hlist_nulls_del_rcu(), running on this same list.
> > > + * However, it is perfectly legal to run concurrently with
> > > + * the _rcu list-traversal primitives, such as
> > > + * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
> > > + * problems on Alpha CPUs.
> > > + */
> > > +static inline void hlist_nulls_add_before_rcu(struct hlist_nulls_node *n,
> > > + struct hlist_nulls_node *next)
> > > +{
> > > + WRITE_ONCE(n->pprev, next->pprev);
> > I do not think WRITE_ONCE() is necessary here, @n is private to this cpu,
> > and following rcu_assign_pointer() has the needed barrier.
> >
> > > + n->next = next;
> > > + rcu_assign_pointer(hlist_nulls_pprev_rcu(n), n);
> > > + WRITE_ONCE(next->pprev, &n->next);
> > > +}
> > > +
> > > /* after that hlist_nulls_del will work */
> > > static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> > > {
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index c8a4b283df6f..42aa1919eeee 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -885,6 +885,11 @@ static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nu
> > > hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
> > > }
> > >
> > > +static inline void __sk_nulls_add_node_before_rcu(struct sock *sk, struct sock *next)
> > > +{
> > > + hlist_nulls_add_before_rcu(&sk->sk_nulls_node, &next->sk_nulls_node);
> > > +}
> > > +
> > > static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
> > > {
> > > sock_hold(sk);
> > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > index ceeeec9b7290..80d8bec41a58 100644
> > > --- a/net/ipv4/inet_hashtables.c
> > > +++ b/net/ipv4/inet_hashtables.c
> > > @@ -334,6 +334,26 @@ static inline int compute_score(struct sock *sk, const struct net *net,
> > > return score;
> > > }
> > >
> > > +static inline int compute_reuseport_score(struct sock *sk)
> > > +{
> > > + int score = 0;
> > > +
> > > + if (sk->sk_bound_dev_if)
> > > + score += 2;
> > > +
> > > + if (sk->sk_family == PF_INET)
> > > + score += 10;
> > > +
> > > + /* the priority of sk_incoming_cpu should be lower than sk_bound_dev_if,
> > > + * as it's optional in compute_score(). Thank God, this is the only
> >
> > Please do not bring God here.
> >
> > > + * variable condition, which we can't judge now.
> > > + */
> > > + if (READ_ONCE(sk->sk_incoming_cpu))
> > > + score++;
> > > +
> > > + return score;
> > > +}
> > > +
> > > /**
> > > * inet_lookup_reuseport() - execute reuseport logic on AF_INET socket if necessary.
> > > * @net: network namespace.
> > > @@ -739,6 +759,27 @@ static int inet_reuseport_add_sock(struct sock *sk,
> > > return reuseport_alloc(sk, inet_rcv_saddr_any(sk));
> > > }
> > >
> > > +static void inet_hash_reuseport(struct sock *sk, struct hlist_nulls_head *head)
> > > +{
> > > + const struct hlist_nulls_node *node;
> > > + int score, curscore;
> > > + struct sock *sk2;
> > > +
> > > + curscore = compute_reuseport_score(sk);
> > > + /* lookup the socket to insert before */
> > > + sk_nulls_for_each_rcu(sk2, node, head) {
> > > + if (!sk2->sk_reuseport)
> > > + continue;
> > > + score = compute_reuseport_score(sk2);
> > > + if (score <= curscore) {
> > > + __sk_nulls_add_node_before_rcu(sk, sk2);
> > > + return;
> > > + }
> > > + }
> > > +
> > > + __sk_nulls_add_node_tail_rcu(sk, head);
> > > +}
> > > +
> > > int __inet_hash(struct sock *sk, struct sock *osk)
> > > {
> > > struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
> > > @@ -761,11 +802,11 @@ int __inet_hash(struct sock *sk, struct sock *osk)
> > > goto unlock;
> > > }
> > > sock_set_flag(sk, SOCK_RCU_FREE);
> > > - if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> > > - sk->sk_family == AF_INET6)
> > > - __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
> > > - else
> > > + if (!sk->sk_reuseport)
> > > __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> > > + else
> > > + inet_hash_reuseport(sk, &ilb2->nulls_head);
> > > +
> > > sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> > > unlock:
> > > spin_unlock(&ilb2->lock);
> > > --
> > > 2.50.1
> > >
Powered by blists - more mailing lists