netdev - Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABWYdi010VpHHi6-PLyBB3F_btFggm7XLxstboCRBvBLdoKdmA@mail.gmail.com>
Date: Fri, 14 Jul 2023 16:38:40 -0700
From: Ivan Babrou <ivan@...udflare.com>
To: David Ahern <dsahern@...nel.org>
Cc: Jakub Kicinski <kuba@...nel.org>, Yan Zhai <yan@...udflare.com>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, kernel-team@...udflare.com, 
	Eric Dumazet <edumazet@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	Paolo Abeni <pabeni@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop

On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@...nel.org> wrote:
> > We can start a separate discussion to break it down by category if it
> > would help. Let me know what kind of information you would like us to
> > provide to help with that. I assume you're interested in kernel stacks
> > leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> > something else.
>
> stack traces would be helpful.

Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/

> > Even if I was only interested in one specific reason, I would still
> > have to arm the whole tracepoint and route a ton of skbs I'm not
> > interested in into my bpf code. This seems like a lot of overhead,
> > especially if I'm dropping some attack packets.
>
> you can add a filter on the tracepoint event to limit what is passed
> (although I have not tried the filter with an ebpf program - e.g.,
> reason != NOT_SPECIFIED).

Absolutely, but isn't there overhead to even do just that for every freed skb?

> > If you have an ebpf example that would help me extract the destination
> > port from an skb in kfree_skb, I'd be interested in taking a look and
> > trying to make it work.
>
> This is from 2020 and I forget which kernel version (pre-BTF), but it
> worked at that time and allowed userspace to summarize drop reasons by
> various network data (mac, L3 address, n-tuple, etc):
>
> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c

It doesn't seem to extract the L4 metadata (local port specifically),
which is what I'm after.

> > The need to extract the protocol level information in ebpf is only
> > making kfree_skb more expensive for the needs of catching rare cases
> > when we run out of buffer space (UDP) or listen queue (TCP). These two
> > cases are very common failure scenarios that people are interested in
> > catching with straightforward tracepoints that can give them the
> > needed information easily and cheaply.
> >
> > I sympathize with the desire to keep the number of tracepoints in
> > check, but I also feel like UDP buffer drops and TCP listen drops
> > tracepoints are very much justified to exist.
>
> sure, kfree_skb is like the raw_syscall tracepoint - it can be more than
> what you need for a specific problem, but it is also give you way more
> than you are thinking about today.

I really like the comparison to raw_syscall tracepoint. There are two flavors:

1. Catch-all: raw_syscalls:sys_enter, which is similar to skb:kfree_skb.
2. Specific tracepoints: syscalls:sys_enter_* for every syscall.

If you are interested in one rare syscall, you wouldn't attach to a
catch-all firehose and the filter for id in post. Understandably, we
probably can't have a separate skb:kfree_skb for every reason.
However, some of them are more useful than others and I believe that
tcp listen drops fall into that category.

We went through a similar exercise with audit subsystem, which in fact
always arms all syscalls even if you audit one of them:

* https://lore.kernel.org/audit/20230523181624.19932-1-ivan@cloudflare.com/T/#u

With pictures, if you're interested:

* https://mastodon.ivan.computer/@mastodon/110426498281603668

Nobody audits futex(), but if you audit execve(), all the rules run
for both. Some rules will run faster, but all of them will run. It's a
lot of overhead with millions of CPUs, which I'm trying to avoid (the
planet is hot as it is).

Ultimately my arguments for a separate tracepoint for tcp listen drops
are at the bottom of my reply to Jakub:

* https://lore.kernel.org/netdev/CABWYdi2BGi=iRCfLhmQCqO=1eaQ1WaCG7F9WsJrz-7==ocZidg@mail.gmail.com/