lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADxym3ZV75C42oy54fUBMBtx8Yw6=XaBhvEedjeFgDWd+aZSnw@mail.gmail.com>
Date: Sat, 2 Aug 2025 08:59:31 +0800
From: Menglong Dong <menglong8.dong@...il.com>
To: Kuniyuki Iwashima <kuniyu@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, kraig@...gle.com, ncardwell@...gle.com, 
	davem@...emloft.net, dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com, 
	horms@...nel.org, netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net v2] net: ip: order the reuseport socket in __inet_hash

On Sat, Aug 2, 2025 at 12:46 AM Kuniyuki Iwashima <kuniyu@...gle.com> wrote:
>
> On Fri, Aug 1, 2025 at 3:42 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> >
> > On Fri, Aug 1, 2025 at 5:46 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Fri, Aug 1, 2025 at 2:09 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> > > >
> > > > For now, the socket lookup will terminate if the socket is reuse port in
> > > > inet_lhash2_lookup(), which makes the socket is not the best match.
> > > >
> > > > For example, we have socket1 and socket2 both listen on "0.0.0.0:1234",
> > > > but socket1 bind on "eth0". We create socket1 first, and then socket2.
> > > > Then, all connections will goto socket2, which is not expected, as socket1
> > > > has higher priority.
> > > >
> > > > This can cause unexpected behavior if TCP MD5 keys is used, as described
> > > > in Documentation/networking/vrf.rst -> Applications.
> > > >
> > > > Therefore, we compute a score for the reuseport socket and add it to the
> > > > list with order in __inet_hash(). Sockets with high score will be added
> > > > to the head.
> > > >
> > > > Link: https://lore.kernel.org/netdev/20250731123309.184496-1-dongml2@chinatelecom.cn/
> > > > Signed-off-by: Menglong Dong <dongml2@...natelecom.cn>
> > >
> > > You forgot a Fixes: tag, and a selftest.
> >
> > I was not sure if it should be a Fixes, I'll add it in the next version.
> > Kuniyuki's test case is nice. Should I put the selftests in the
> > commit log?
>
> The python example is handy and easy to understand the
> issue, so feel free to add it to the commit log if needed.
>
> But please add a separate patch to add a test under
> tools/testing/selftest/net/.
>
> It will help us not introduce regression in the future as it's
> run for each patch by NIPA CI.

Ok! I'll add a testcase to tools/testing/selftest/net/

>
>
> >
> > >
> > > > ---
> > > > v2:
> > > > - As Kuniyuki advised, sort the reuseport socket in __inet_hash() to keep
> > > >   the lookup for reuseport O(1)
> > >
> > > Keeping sorted the list is difficult, we would have to intercept
> > > SO_BINDTODEVICE, SO_BINDTOIFINDEX, SO_INCOMING_CPU.
> > >
> > > This also makes the patch risky to backport to stable versions,
> > > because it is complex and possibly buggy.
> > >
> > > Therefore I prefer your first approach.
> >
> > Kuniyuki also has a similar patch:
> > https://lore.kernel.org/netdev/CADxym3ZY7Lm9mgv83e2db7o3ZZMcLDa=vDf6nJSs1m0_tUk5Bg@mail.gmail.com/T/#m56ee67b2fdf85ce568fd1339def92c53232d5b49
> >
> > Will his be better and stable? Kuniyuki say the first approach
> > kill the O(1) lookup for reuseport socket :/
>
> At least your compute_reuseport_score() is wrong;
> so_incoming_cpu is not considered to group reuseport
> sockets, it does not take wildcard and ipv6_only into
> account, etc..

Should ipv6_only be considered? If socketA is ipv6_only,
and socketB is not, a IPv6 connection should select
socketA? I don't see such logic in compute_score() :/

>
> And I agree this is net-next material, a bit risky to backport.
>
> Once net-next is open, I'll follow up to restore the O(1)
> lookup with a few more patches to handle corner cases
> that I mentioned in v1 thread.
>
> >
> > Anyway, I'll send a V3 with the first approach, and with
> > the Fixes + selftests
>
> nit: The subject prefix should start with "tcp:" as UDP
> and SCTP do not seem to have this issue.

Ok!

>
>
> >
> > Thanks!
> > Menglong Dong
> >
> > >
> > > > ---
> > > >  include/linux/rculist_nulls.h | 34 ++++++++++++++++++++++++
> > > >  include/net/sock.h            |  5 ++++
> > > >  net/ipv4/inet_hashtables.c    | 49 ++++++++++++++++++++++++++++++++---
> > > >  3 files changed, 84 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> > > > index 89186c499dd4..da500f4ae142 100644
> > > > --- a/include/linux/rculist_nulls.h
> > > > +++ b/include/linux/rculist_nulls.h
> > > > @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
> > > >  #define hlist_nulls_next_rcu(node) \
> > > >         (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
> > > >
> > > > +/**
> > > > + * hlist_nulls_pprev_rcu - returns the element of the list after @node.
> > > > + * @node: element of the list.
> > > > + */
> > > > +#define hlist_nulls_pprev_rcu(node) \
> > > > +       (*((struct hlist_nulls_node __rcu __force **)&(node)->pprev))
> > > > +
> > > >  /**
> > > >   * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
> > > >   * @n: the element to delete from the hash list.
> > > > @@ -145,6 +152,33 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
> > > >         }
> > > >  }
> > > >
> > > > +/**
> > > > + * hlist_nulls_add_before_rcu
> > > > + * @n: the new element to add to the hash list.
> > > > + * @next: the existing element to add the new element before.
> > > > + *
> > > > + * Description:
> > > > + * Adds the specified element to the specified hlist
> > > > + * before the specified node while permitting racing traversals.
> > > > + *
> > > > + * The caller must take whatever precautions are necessary
> > > > + * (such as holding appropriate locks) to avoid racing
> > > > + * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
> > > > + * or hlist_nulls_del_rcu(), running on this same list.
> > > > + * However, it is perfectly legal to run concurrently with
> > > > + * the _rcu list-traversal primitives, such as
> > > > + * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
> > > > + * problems on Alpha CPUs.
> > > > + */
> > > > +static inline void hlist_nulls_add_before_rcu(struct hlist_nulls_node *n,
> > > > +                                             struct hlist_nulls_node *next)
> > > > +{
> > > > +       WRITE_ONCE(n->pprev, next->pprev);
> > > I do not think WRITE_ONCE() is necessary here, @n is private to this cpu,
> > > and following rcu_assign_pointer() has the needed barrier.
> > >
> > > > +       n->next = next;
> > > > +       rcu_assign_pointer(hlist_nulls_pprev_rcu(n), n);
> > > > +       WRITE_ONCE(next->pprev, &n->next);
> > > > +}
> > > > +
> > > >  /* after that hlist_nulls_del will work */
> > > >  static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> > > >  {
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index c8a4b283df6f..42aa1919eeee 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -885,6 +885,11 @@ static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nu
> > > >         hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
> > > >  }
> > > >
> > > > +static inline void __sk_nulls_add_node_before_rcu(struct sock *sk, struct sock *next)
> > > > +{
> > > > +       hlist_nulls_add_before_rcu(&sk->sk_nulls_node, &next->sk_nulls_node);
> > > > +}
> > > > +
> > > >  static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
> > > >  {
> > > >         sock_hold(sk);
> > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > index ceeeec9b7290..80d8bec41a58 100644
> > > > --- a/net/ipv4/inet_hashtables.c
> > > > +++ b/net/ipv4/inet_hashtables.c
> > > > @@ -334,6 +334,26 @@ static inline int compute_score(struct sock *sk, const struct net *net,
> > > >         return score;
> > > >  }
> > > >
> > > > +static inline int compute_reuseport_score(struct sock *sk)
> > > > +{
> > > > +       int score = 0;
> > > > +
> > > > +       if (sk->sk_bound_dev_if)
> > > > +               score += 2;
> > > > +
> > > > +       if (sk->sk_family == PF_INET)
> > > > +               score += 10;
> > > > +
> > > > +       /* the priority of sk_incoming_cpu should be lower than sk_bound_dev_if,
> > > > +        * as it's optional in compute_score(). Thank God, this is the only
> > >
> > > Please do not bring God here.
> > >
> > > > +        * variable condition, which we can't judge now.
> > > > +        */
> > > > +       if (READ_ONCE(sk->sk_incoming_cpu))
> > > > +               score++;
> > > > +
> > > > +       return score;
> > > > +}
> > > > +
> > > >  /**
> > > >   * inet_lookup_reuseport() - execute reuseport logic on AF_INET socket if necessary.
> > > >   * @net: network namespace.
> > > > @@ -739,6 +759,27 @@ static int inet_reuseport_add_sock(struct sock *sk,
> > > >         return reuseport_alloc(sk, inet_rcv_saddr_any(sk));
> > > >  }
> > > >
> > > > +static void inet_hash_reuseport(struct sock *sk, struct hlist_nulls_head *head)
> > > > +{
> > > > +       const struct hlist_nulls_node *node;
> > > > +       int score, curscore;
> > > > +       struct sock *sk2;
> > > > +
> > > > +       curscore = compute_reuseport_score(sk);
> > > > +       /* lookup the socket to insert before */
> > > > +       sk_nulls_for_each_rcu(sk2, node, head) {
> > > > +               if (!sk2->sk_reuseport)
> > > > +                       continue;
> > > > +               score = compute_reuseport_score(sk2);
> > > > +               if (score <= curscore) {
> > > > +                       __sk_nulls_add_node_before_rcu(sk, sk2);
> > > > +                       return;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       __sk_nulls_add_node_tail_rcu(sk, head);
> > > > +}
> > > > +
> > > >  int __inet_hash(struct sock *sk, struct sock *osk)
> > > >  {
> > > >         struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
> > > > @@ -761,11 +802,11 @@ int __inet_hash(struct sock *sk, struct sock *osk)
> > > >                         goto unlock;
> > > >         }
> > > >         sock_set_flag(sk, SOCK_RCU_FREE);
> > > > -       if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
> > > > -               sk->sk_family == AF_INET6)
> > > > -               __sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
> > > > -       else
> > > > +       if (!sk->sk_reuseport)
> > > >                 __sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> > > > +       else
> > > > +               inet_hash_reuseport(sk, &ilb2->nulls_head);
> > > > +
> > > >         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> > > >  unlock:
> > > >         spin_unlock(&ilb2->lock);
> > > > --
> > > > 2.50.1
> > > >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ