netdev - Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABWYdi3VJU7HUxzKJBKgX9wF9GRvmA0TKVpjuHvJyz_EdpxZFA@mail.gmail.com>
Date: Thu, 13 Jul 2023 16:17:31 -0700
From: Ivan Babrou <ivan@...udflare.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Yan Zhai <yan@...udflare.com>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, kernel-team@...udflare.com, 
	Eric Dumazet <edumazet@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	Paolo Abeni <pabeni@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>, David Ahern <dsahern@...nel.org>
Subject: Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop

On Wed, Jul 12, 2023 at 10:42 AM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 12 Jul 2023 11:42:26 -0500 Yan Zhai wrote:
> >   The issue with kfree_skb is not that it fires too frequently (not in
> > the 6.x kernel now). Rather, it is unable to locate the socket info
> > when a SYN is dropped due to the accept queue being full. The sk is
> > stolen upon inet lookup, e.g. in tcp_v4_rcv. This makes it unable to
> > tell in kfree_skb which socket a SYN skb is targeting (when TPROXY or
> > socket lookup are used). A tracepoint with sk information will be more
> > useful to monitor accurately which service/socket is involved.
>
> No doubt that kfree_skb isn't going to solve all our needs, but I'd
> really like you to clean up the unnecessary callers on your systems
> first, before adding further tracepoints. That way we'll have a clear
> picture of which points can be solved by kfree_skb and where we need
> further work.

The existing UDP tracepoint was there for 12 years and it's a part of
what kernel exposes to userspace, so I don't think it's fair to remove
this and break its consumers. I think "do not break userspace" applies
here. The proposed TCP tracepoint mostly mirrors it, so I think it's
fair to have it.

I don't know why kfree_skb is called so much. I also don't agree with
Yan that it's not actually too much, because it's a lot (especially
compared with near zero for my proposed tracepoint). I can easily see
300-500k calls per second into it:

$ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
#           time             counts unit events
     1.000520165             10,108      skb:kfree_skb
     2.010494526             11,178      skb:kfree_skb
     3.075503743             10,770      skb:kfree_skb
     4.122814843             11,334      skb:kfree_skb
     5.128518432             12,020      skb:kfree_skb
     6.176504094             11,117      skb:kfree_skb
     7.201504214             12,753      skb:kfree_skb
     8.229523643             10,566      skb:kfree_skb
     9.326499044            365,239      skb:kfree_skb
    10.002106098            313,105      skb:kfree_skb
$ perf stat -I 1000 -a -e skb:kfree_skb -- sleep 10
#           time             counts unit events
     1.000767744             52,240      skb:kfree_skb
     2.069762695            508,310      skb:kfree_skb
     3.102763492            417,895      skb:kfree_skb
     4.142757608            385,981      skb:kfree_skb
     5.190759795            430,154      skb:kfree_skb
     6.243765384            405,707      skb:kfree_skb
     7.290818228            362,934      skb:kfree_skb
     8.297764298            336,702      skb:kfree_skb
     9.314287243            353,039      skb:kfree_skb
    10.002288423            251,414      skb:kfree_skb

Most of it is NOT_SPECIFIED (1s data from one CPU during a spike):

$ perf script | sed 's/.*skbaddr=//' | awk '{ print $NF }' | sort |
uniq -c | sort -n | tail
      1 TCP_CLOSE
      2 NO_SOCKET
      4 TCP_INVALID_SEQUENCE
      4 TCP_RESET
     13 TCP_OLD_DATA
     14 NETFILTER_DROP
   4594 NOT_SPECIFIED

We can start a separate discussion to break it down by category if it
would help. Let me know what kind of information you would like us to
provide to help with that. I assume you're interested in kernel stacks
leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
something else.

Even if I was only interested in one specific reason, I would still
have to arm the whole tracepoint and route a ton of skbs I'm not
interested in into my bpf code. This seems like a lot of overhead,
especially if I'm dropping some attack packets.

Perhaps a lot of extra NOT_SPECIFIED stuff can be fixed and removed
from kfree_skb. It's not something I can personally do as it requires
much deeper network code understanding than I possess. For TCP we'll
also have to add some extra reasons for kfree_skb, because currently
it's all NOT_SPECIFIED (no reason set in the accept path):

* https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_input.c#L6499
* https://elixir.bootlin.com/linux/v6.5-rc1/source/net/ipv4/tcp_ipv4.c#L1749

For UDP we already have SKB_DROP_REASON_SOCKET_RCVBUFF, so I tried my
best to implement what I wanted based on that. It's not very
approachable, as you'd have to extract the destination port yourself
from the raw skb. As Yan said, for TCP people often rely on skb->sk,
which is just not present when the incoming SYN is dropped. I failed
to find a good example of extracting a destination port that I could
replicate. So far I have just a per-reason breakdown working:

* https://github.com/cloudflare/ebpf_exporter/pull/233

If you have an ebpf example that would help me extract the destination
port from an skb in kfree_skb, I'd be interested in taking a look and
trying to make it work.

The need to extract the protocol level information in ebpf is only
making kfree_skb more expensive for the needs of catching rare cases
when we run out of buffer space (UDP) or listen queue (TCP). These two
cases are very common failure scenarios that people are interested in
catching with straightforward tracepoints that can give them the
needed information easily and cheaply.

I sympathize with the desire to keep the number of tracepoints in
check, but I also feel like UDP buffer drops and TCP listen drops
tracepoints are very much justified to exist.