[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230619175824.50385-1-kuniyu@amazon.com>
Date: Mon, 19 Jun 2023 10:58:24 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <edumazet@...gle.com>
CC: <davem@...emloft.net>, <dsahern@...nel.org>, <duanmuquan@...du.com>,
<kuba@...nel.org>, <kuniyu@...zon.com>, <linux-kernel@...r.kernel.org>,
<netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.
From: Eric Dumazet <edumazet@...gle.com>
Date: Mon, 19 Jun 2023 19:39:20 +0200
> On Mon, Jun 19, 2023 at 7:03 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@...gle.com>
> > Date: Thu, 8 Jun 2023 08:35:20 +0200
> > > On Thu, Jun 8, 2023 at 7:48 AM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > >
> > > > From: Eric Dumazet <edumazet@...gle.com>
> > > > Date: Wed, 7 Jun 2023 15:32:57 +0200
> > > > > On Wed, Jun 7, 2023 at 1:59 PM Duan,Muquan <duanmuquan@...du.com> wrote:
> > > > > >
> > > > > > Hi, Eric,
> > > > > >
> > > > > > Thanks for your comments!
> > > > > >
> > > > > > About the second lookup, I am sorry that I did not give enough explanations about it. Here are some details:
> > > > > >
> > > > > > 1. The second lookup can find the tw sock and avoid the connection refuse error on userland applications:
> > > > > >
> > > > > > If the original sock is found, but when validating its refcnt, it has been destroyed and sk_refcnt has become 0 after decreased by tcp_time_wait()->tcp_done()->inet_csk_destory_sock()->sock_put().The validation for refcnt fails and the lookup process gets a listener sock.
> > > > > >
> > > > > > When this case occurs, the hashdance has definitely finished,because tcp_done() is executed after inet_twsk_hashdance(). Then if look up the ehash table again, hashdance has already finished, tw sock will be found.
> > > > > >
> > > > > > With this fix, logically we can solve the connection reset issue completely when no established sock is found due to hashdance race.In my reproducing environment, the connection refuse error will occur about every 6 hours with only the fix of bad case (2). But with both of the 2 fixes, I tested it many times, the longest test continues for 10 days, it does not occur again,
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2. About the performance impact:
> > > > > >
> > > > > > A similar scenario is that __inet_lookup_established() will do inet_match() check for the second time, if fails it will look up the list again. It is the extra effort to reduce the race impact without using reader lock. inet_match() failure occurs with about the same probability with refcnt validation failure in my test environment.
> > > > > >
> > > > > > The second lookup will only be done in the condition that FIN segment gets a listener sock.
> > > > > >
> > > > > > About the performance impact:
> > > > > >
> > > > > > 1) Most of the time, this condition will not met, the added codes introduces at most 3 comparisons for each segment.
> > > > > >
> > > > > > The second inet_match() in __inet_lookup_established() does least 3 comparisons for each segmet.
> > > > > >
> > > > > >
> > > > > > 2) When this condition is met, the probability is very small. The impact is similar to the second try due to inet_match() failure. Since tw sock can definitely be found in the second try, I think this cost is worthy to avoid connection reused error on userland applications.
> > > > > >
> > > > > >
> > > > > >
> > > > > > My understanding is, current philosophy is avoiding the reader lock by tolerating the minor defect which occurs in a small probability.For example, if the FIN from passive closer is dropped due to the found sock is destroyed, a retransmission can be tolerated, it only makes the connection termination slower. But I think the bottom line is that it does not affect the userland applications’ functionality. If application fails to connect due to the hashdance race, it can’t be tolerated. In fact, guys from product department push hard on the connection refuse error.
> > > > > >
> > > > > >
> > > > > > About bad case (2):
> > > > > >
> > > > > > tw sock is found, but its tw_refcnt has not been set to 3, it is still 0, validating for sk_refcnt will fail.
> > > > > >
> > > > > > I do not know the reason why setting tw_refcnt after adding it into list, could anyone help point out the reason? It adds extra race because the new added tw sock may be found and checked in other CPU concurrently before ƒsetting tw_refcnt to 3.
> > > > > >
> > > > > > By setting tw_refcnt to 3 before adding it into list, this case will be solved, and almost no cost. In my reproducing environment, it occurs more frequently than bad case (1), it appears about every 20 minutes, bad case (1) appears about every 6 hours.
> > > > > >
> > > > > >
> > > > > >
> > > > > > About the bucket spinlock, the original established sock and tw sock are stored in the ehash table, I concern about the performance when there are lots of short TCP connections, the reader lock may affect the performance of connection creation and termination. Could you share some details of your idea? Thanks in advance.
> > > > > >
> > > > > >
> > > > >
> > > > > Again, you can write a lot of stuff, the fact is that your patch does
> > > > > not solve the issue.
> > > > >
> > > > > You could add 10 lookups, and still miss some cases, because they are
> > > > > all RCU lookups with no barriers.
> > > > >
> > > > > In order to solve the issue of packets for the same 4-tuple being
> > > > > processed by many cpus, the only way to solve races is to add mutual
> > > > > exclusion.
> > > > >
> > > > > Note that we already have to lock the bucket spinlock every time we
> > > > > transition a request socket to socket, a socket to timewait, or any
> > > > > insert/delete.
> > > > >
> > > > > We need to expand the scope of this lock, and cleanup things that we
> > > > > added in the past, because we tried too hard to 'detect races'
> > > >
> > > > How about this ? This is still a workaround though, retry sounds
> > > > better than expanding the scope of the lock given the race is rare.
> > >
> > > The chance of two cpus having to hold the same spinlock is rather small.
> > >
> > > Algo is the following:
> > >
> > > Attempt a lockless/RCU lookup.
> > >
> > > 1) Socket is found, we are good to go. Fast path is still fast.
> > >
> > > 2) Socket is not found in ehash
> > > - We lock the bucket spinlock.
> > > - We retry the lookup
> > > - If socket found, continue with it (release the spinlock when
> > > appropriate, after all write manipulations in the bucket are done)
> > > - If socket still not found, we lookup a listener.
> > > We insert a TCP_NEW_SYN_RECV ....
> > > Again, we release the spinlock when appropriate, after all
> > > write manipulations in the bucket are done)
> > >
> > > No more races, and the fast path is the same.
> >
> > I was looking around the issue this weekend. Is this what you were
> > thinking ? I'm wondering if you were also thinking another races like
> > found_dup_sk/own_req things. e.g.) acquire ehash lock when we start to
> > process reqsk ?
>
> No.
>
> Idea is to hold the bucket lock and keep it until all write operations
> in the bucket have been done.
Thanks for clarification! I understood why it takes time.
Actually I started on the way like
1. convert bool *refcounted to enum (noref, refcnt, locked)
2. add 2nd ehash lookup hoding lock and mark locked
3. release lock when necessary
but switched to this version because there were many callers of
__inet_lookup().
>
>
>
> >
> > Duan, could you test the diff below ?
> >
> > If this resolves the FIN issue, we can also revert 3f4ca5fafc08 ("tcp:
> > avoid the lookup process failing to get sk in ehash table").
> >
> > ---8<---
> > diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> > index 56f1286583d3..bb8e49a6e80f 100644
> > --- a/include/net/inet6_hashtables.h
> > +++ b/include/net/inet6_hashtables.h
> > @@ -48,6 +48,11 @@ struct sock *__inet6_lookup_established(struct net *net,
> > const u16 hnum, const int dif,
> > const int sdif);
> >
> > +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const struct in6_addr *saddr, const __be16 sport,
> > + const struct in6_addr *daddr, const u16 hnum,
> > + const int dif, const int sdif);
> > +
> > struct sock *inet6_lookup_listener(struct net *net,
> > struct inet_hashinfo *hashinfo,
> > struct sk_buff *skb, int doff,
> > @@ -70,9 +75,15 @@ static inline struct sock *__inet6_lookup(struct net *net,
> > struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
> > sport, daddr, hnum,
> > dif, sdif);
> > - *refcounted = true;
> > - if (sk)
> > +
> > + if (!sk)
> > + sk = __inet6_lookup_established_lock(net, hashinfo, saddr, sport,
> > + daddr, hnum, dif, sdif);
> > + if (sk) {
> > + *refcounted = true;
> > return sk;
> > + }
> > +
> > *refcounted = false;
> > return inet6_lookup_listener(net, hashinfo, skb, doff, saddr, sport,
> > daddr, hnum, dif, sdif);
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 99bd823e97f6..ad97fec63d7a 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -379,6 +379,12 @@ struct sock *__inet_lookup_established(struct net *net,
> > const __be32 daddr, const u16 hnum,
> > const int dif, const int sdif);
> >
> > +struct sock *__inet_lookup_established_lock(struct net *net,
> > + struct inet_hashinfo *hashinfo,
> > + const __be32 saddr, const __be16 sport,
> > + const __be32 daddr, const u16 hnum,
> > + const int dif, const int sdif);
> > +
> > static inline struct sock *
> > inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
> > const __be32 saddr, const __be16 sport,
> > @@ -402,9 +408,14 @@ static inline struct sock *__inet_lookup(struct net *net,
> >
> > sk = __inet_lookup_established(net, hashinfo, saddr, sport,
> > daddr, hnum, dif, sdif);
> > - *refcounted = true;
> > - if (sk)
> > + if (!sk)
> > + sk = __inet_lookup_established_lock(net, hashinfo, saddr, sport,
> > + daddr, hnum, dif, sdif);
> > + if (sk) {
> > + *refcounted = true;
> > return sk;
> > + }
> > +
> > *refcounted = false;
> > return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,
> > sport, daddr, hnum, dif, sdif);
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index e7391bf310a7..1eeadaf1c9f9 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -514,6 +514,41 @@ struct sock *__inet_lookup_established(struct net *net,
> > }
> > EXPORT_SYMBOL_GPL(__inet_lookup_established);
> >
> > +struct sock *__inet_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const __be32 saddr, const __be16 sport,
> > + const __be32 daddr, const u16 hnum,
> > + const int dif, const int sdif)
> > +{
> > + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> > + INET_ADDR_COOKIE(acookie, saddr, daddr);
> > + const struct hlist_nulls_node *node;
> > + struct inet_ehash_bucket *head;
> > + unsigned int hash;
> > + spinlock_t *lock;
> > + struct sock *sk;
> > +
> > + hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
> > + head = inet_ehash_bucket(hashinfo, hash);
> > + lock = inet_ehash_lockp(hashinfo, hash);
> > +
> > + spin_lock(lock);
> > + sk_nulls_for_each(sk, node, &head->chain) {
> > + if (sk->sk_hash != hash)
> > + continue;
> > +
> > + if (unlikely(!inet_match(net, sk, acookie, ports, dif, sdif)))
> > + continue;
> > +
> > + sock_hold(sk);
> > + spin_unlock(lock);
> > + return sk;
> > + }
> > + spin_unlock(lock);
>
> Here we need to keep the lock held, and release it later.
>
> > +
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(__inet_lookup_established_lock);
> > +
> > /* called with local bh disabled */
> > static int __inet_check_established(struct inet_timewait_death_row *death_row,
> > struct sock *sk, __u16 lport,
> > diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> > index b64b49012655..1b2c971859c0 100644
> > --- a/net/ipv6/inet6_hashtables.c
> > +++ b/net/ipv6/inet6_hashtables.c
> > @@ -89,6 +89,40 @@ struct sock *__inet6_lookup_established(struct net *net,
> > }
> > EXPORT_SYMBOL(__inet6_lookup_established);
> >
> > +struct sock *__inet6_lookup_established_lock(struct net *net, struct inet_hashinfo *hashinfo,
> > + const struct in6_addr *saddr, const __be16 sport,
> > + const struct in6_addr *daddr, const u16 hnum,
> > + const int dif, const int sdif)
> > +{
> > + const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
> > + const struct hlist_nulls_node *node;
> > + struct inet_ehash_bucket *head;
> > + unsigned int hash;
> > + spinlock_t *lock;
> > + struct sock *sk;
> > +
> > + hash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
> > + head = inet_ehash_bucket(hashinfo, hash);
> > + lock = inet_ehash_lockp(hashinfo, hash);
> > +
> > + spin_lock(lock);
> > + sk_nulls_for_each(sk, node, &head->chain) {
> > + if (sk->sk_hash != hash)
> > + continue;
> > +
> > + if (unlikely(!inet6_match(net, sk, saddr, daddr, ports, dif, sdif)))
> > + continue;
> > +
> > + sock_hold(sk);
> > + spin_unlock(lock);
> > + return sk;
> > + }
> > + spin_unlock(lock);
>
> /* Same here. */
>
> > +
> > + return NULL;
> > +}
> > +EXPORT_SYMBOL(__inet6_lookup_established_lock);
> > +
> > static inline int compute_score(struct sock *sk, struct net *net,
> > const unsigned short hnum,
> > const struct in6_addr *daddr,
> > ---8<---
Powered by blists - more mailing lists