netdev - Re: [PATCH net-next] tcp: introduce a new MIB for CLOSE-WAIT sockets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAO2aqoGLV7ixL=M-Fi7GU6juD-RQhtn7YASoe5_tjxZw@mail.gmail.com>
Date: Tue, 28 May 2024 16:34:06 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com, 
	davem@...emloft.net, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>, Yongming Liu <yomiliu@...cent.com>, 
	Wangzi Yong <curuwang@...cent.com>
Subject: Re: [PATCH net-next] tcp: introduce a new MIB for CLOSE-WAIT sockets

On Tue, May 28, 2024 at 3:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, May 28, 2024 at 8:48 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > Hello Eric,
> >
> > On Tue, May 28, 2024 at 1:13 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Tue, May 28, 2024 at 4:12 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@...cent.com>
> > > >
> > > > CLOSE-WAIT is a relatively special state which "represents waiting for
> > > > a connection termination request from the local user" (RFC 793). Some
> > > > issues may happen because of unexpected/too many CLOSE-WAIT sockets,
> > > > like user application mistakenly handling close() syscall.
> > > >
> > > > We want to trace this total number of CLOSE-WAIT sockets fastly and
> > > > frequently instead of resorting to displaying them altogether by using:
> > > >
> > > >   netstat -anlp | grep CLOSE_WAIT
> > >
> > > This is horribly expensive.
> >
> > Yes.
> >
> > > Why asking af_unix and program names ?
> > > You want to count some TCP sockets in a given state, right ?
> > > iproute2 interface (inet_diag) can do the filtering in the kernel,
> > > saving a lot of cycles.
> > >
> > > ss -t state close-wait
> >
> > Indeed, it is much better than netstat but not that good/fast enough
> > if we've already generated a lot of sockets. This command is suitable
> > for debug use, but not for frequent sampling, say, every 10 seconds.
> > More than this, RFC 1213 defines CurrEstab which should also include
> > close-wait sockets, but we don't have this one.
>
> "we don't have this one."
> You mean we do not have CurrEstab ?
> That might be user space decision to not display it from nstat
> command, in useless_number()
> (Not sure why. If someone thought it was useless, then CLOSE_WAIT
> count is even more useless...)

It has nothing to do with user applications.

Let me give one example, ss -s can show the value of 'estab' which is
derived from /proc/net/snmp file.

What the corresponding CurrEstab implemented in the kernel is only
counting established sockets not including close-wait sockets in
tcp_set_state().

The reason why it does not count close-wait sockets like RFC says is unknown.

>
> > I have no intention to
> > change the CurrEstab in Linux because it has been used for a really
> > long time. So I chose to introduce a new counter in linux mib
> > definitions.
> >
> > >
> > > >
> > > > or something like this, which does harm to the performance especially in
> > > > heavy load. That's the reason why I chose to introduce this new MIB counter
> > > > like CurrEstab does. It do help us diagnose/find issues in production.
> > > >
> > > > Besides, in the group of TCP_MIB_* defined by RFC 1213, TCP_MIB_CURRESTAB
> > > > should include both ESTABLISHED and CLOSE-WAIT sockets in theory:
> > > >
> > > >   "tcpCurrEstab OBJECT-TYPE
> > > >    ...
> > > >    The number of TCP connections for which the current state
> > > >    is either ESTABLISHED or CLOSE- WAIT."
> > > >
> > > > Apparently, at least since 2005, we don't count CLOSE-WAIT sockets. I think
> > > > there is a need to count it separately to avoid polluting the existing
> > > > TCP_MIB_CURRESTAB counter.
> > > >
> > > > After this patch, we can see the counter by running 'cat /proc/net/netstat'
> > > > or 'nstat -s | grep CloseWait'
> > >
> > > I find this counter quite not interesting.
> > > After a few days of uptime, let say it is 52904523
> > > What can you make of this value exactly ?
> > > How do you make any correlation ?
> >
> > There are two ways of implementing this counter:
> > 1) like the counters in 'linux mib definitions', we have to 'diff' the
> > counter then we can know how many close-wait sockets generated in a
> > certain period.
>
> And what do you make of this raw information ?
>
> if it is 10000 or 20000 in a 10-second period, what conclusion do you get ?

Some buggy/stupid user applications cannot handle taking care of
finishing a socket (by using close()), which is a very classic and
common problem in production. If it happens, many weird things could
happen.

If we cannot reproduce the issue easily, we have to trace the monitor
history that collects the close-wait sockets in history.

With this counter implemented, we can record/observe the normal
changes of this counter all the time. It can help us:
1) We are able to know in advance if the counter changes drastically.
2) If some users report some issues happening, we will particularly
pay more attention to it.

>
> Receiving FIN packets is a fact of life, I see no reason to worry.
>
> > 2) like what CurrEstab does, then we have to introduce a new helper
> > (for example, NET_DEC_STATS) to decrement the counter if the state of
> > the close-wait socket changes in tcp_set_state().
> >
> > After thinking more about your question, the latter is better because
> > it can easily reflect the current situation, right? What do you think?
>
> I think you are sending not tested patches.

No.

Honestly, what I've done in our private kernel is different from this patch:
I added a new counter in 'tcp mib definitions' (see diff patch [1]).
You know, it is not good to submit such a patch to the kernel
community because 'tcp mib definitions' enjoys a long history and not
touched more than 10 years. If I do so, I can imagine you might
question/challenge me. I can picture it. :(

Then I decided to re-implement it in 'linux mib definitions'. This
file is touched in these years.

[1]:
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index adf5fd78dd50..27beab1002ce 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -144,6 +144,7 @@ enum
  TCP_MIB_INERRS, /* InErrs */
  TCP_MIB_OUTRSTS, /* OutRsts */
  TCP_MIB_CSUMERRORS, /* InCsumErrors */
+ TCP_MIB_CURRCLOSEWAIT, /* CurrCloseWait */
  __TCP_MIB_MAX
 };

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 6c4664c681ca..5d2a175a6d35 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -157,6 +157,7 @@ static const struct snmp_mib snmp4_tcp_list[] = {
  SNMP_MIB_ITEM("InErrs", TCP_MIB_INERRS),
  SNMP_MIB_ITEM("OutRsts", TCP_MIB_OUTRSTS),
  SNMP_MIB_ITEM("InCsumErrors", TCP_MIB_CSUMERRORS),
+ SNMP_MIB_ITEM("CurrCloseWait", TCP_MIB_CURRCLOSEWAIT),
  SNMP_MIB_SENTINEL
 };

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 681b54e1f3a6..f1bbbd477cda 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2659,6 +2659,10 @@ void tcp_set_state(struct sock *sk, int state)
  default:
  if (oldstate == TCP_ESTABLISHED)
  TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+ if (state == TCP_CLOSE_WAIT)
+ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRCLOSEWAIT);
+ if (oldstate == TCP_CLOSE_WAIT)
+ TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRCLOSEWAIT);
  }

  /* Change state AFTER socket is unhashed to avoid closed
--
2.37.3

>
> I suggest you use your patches first, and send tests so that we can all see
> the intent, how this is supposed to work in the first place.

This patch is not proposed out of thin air.

Normally, we have two kinds of issue/bug reports:
1) we can easily reproduce, so the issue can be easily diagnosed.
2) It happens in history, say, last night, which cannot be traced
easily. We have to implement a monitor or an agent to know what
happened last night. What I'm doing is like this.

A few years ago, I worked at another company and cooperated with some
guys in Google. We all find it is hard to get to the root cause of
this kind of issue (like spike) as above, so the only thing we can do
is to trace/record more useful information which helps us narrow down
the issue. I'm just saying. It's not effortless to deal with all kinds
of possible issues daily.

And finishing the work of counting close-wait sockets according to RFC
is one of reasons but not that important.

Thanks for your patience and review and suggestions :)

Thanks,
Jason