linux-kernel - Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABWYdi3L3HddnoVvhwAD0Mm57AfPDF0J2z42OV_hO34Ev0XNSw@mail.gmail.com>
Date:   Tue, 18 Jul 2023 15:10:09 -0700
From:   Ivan Babrou <ivan@...udflare.com>
To:     David Ahern <dsahern@...nel.org>
Cc:     Steven Rostedt <rostedt@...dmis.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Yan Zhai <yan@...udflare.com>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...udflare.com,
        Eric Dumazet <edumazet@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        Paolo Abeni <pabeni@...hat.com>,
        Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [RFC PATCH net-next] tcp: add a tracepoint for tcp_listen_queue_drop

On Fri, Jul 14, 2023 at 6:30 PM David Ahern <dsahern@...nel.org> wrote:
>
> On 7/14/23 5:38 PM, Ivan Babrou wrote:
> > On Fri, Jul 14, 2023 at 8:09 AM David Ahern <dsahern@...nel.org> wrote:
> >>> We can start a separate discussion to break it down by category if it
> >>> would help. Let me know what kind of information you would like us to
> >>> provide to help with that. I assume you're interested in kernel stacks
> >>> leading to kfree_skb with NOT_SPECIFIED reason, but maybe there's
> >>> something else.
> >>
> >> stack traces would be helpful.
> >
> > Here you go: https://lore.kernel.org/netdev/CABWYdi00L+O30Q=Zah28QwZ_5RU-xcxLFUK2Zj08A8MrLk9jzg@mail.gmail.com/
> >
> >>> Even if I was only interested in one specific reason, I would still
> >>> have to arm the whole tracepoint and route a ton of skbs I'm not
> >>> interested in into my bpf code. This seems like a lot of overhead,
> >>> especially if I'm dropping some attack packets.
> >>
> >> you can add a filter on the tracepoint event to limit what is passed
> >> (although I have not tried the filter with an ebpf program - e.g.,
> >> reason != NOT_SPECIFIED).
> >
> > Absolutely, but isn't there overhead to even do just that for every freed skb?
>
> There is some amount of overhead. If filters can be used with ebpf
> programs, then the differential cost is just the cycles for the filter
> which in this case is an integer compare. Should be low - maybe Steven
> has some data on the overhead?

I updated my benchmarks and added two dimensions:

* Empty probe that just returns immediately (simple and complex map
increments were already there)
* Tracepoint probe (fprobe and kprobe were already there)

The results are here:

* https://github.com/cloudflare/ebpf_exporter/tree/master/benchmark

It looks like we can expect an empty tracepoint probe to finish in
~15ns. At least that's what I see on my M1 laptop in a VM running
v6.5-rc1.

15ns x 400k calls/s = 6ms/s or 0.6% of a single CPU if all you do is
nothing (which is the likely outcome) for kfree_skb tracepoint.

I guess it's not as terrible as I expected, which is good news.

> >>> If you have an ebpf example that would help me extract the destination
> >>> port from an skb in kfree_skb, I'd be interested in taking a look and
> >>> trying to make it work.
> >>
> >> This is from 2020 and I forget which kernel version (pre-BTF), but it
> >> worked at that time and allowed userspace to summarize drop reasons by
> >> various network data (mac, L3 address, n-tuple, etc):
> >>
> >> https://github.com/dsahern/bpf-progs/blob/master/ksrc/pktdrop.c
> >
> > It doesn't seem to extract the L4 metadata (local port specifically),
> > which is what I'm after.
>
> This program takes the path of copy headers to userspace and does the
> parsing there (there is a netmon program that uses that ebpf program
> which shows drops for varying perspectives). You can just as easily do
> the parsing in ebpf. Once you have the start of packet data, walk the
> protocols of interest -- e.g., see parse_pkt in flow.h.

I see, thanks. I want to do all the aggregation in the kernel, so I
took a stab at that. With a lot of trial and error I was able to come
up with the following:

* https://github.com/cloudflare/ebpf_exporter/pull/235

Some points from my experience doing that:

* It was a lot harder to get it working than the tracepoint I
proposed. There are few examples of parsing skb in the kernel in bpf
and none do what I wanted to do.
* It is unclear whether this would work with vlan or other
encapsulation. Your code has special handling for vlan. As a user I
just want the L4 port.
* There's a lot more boilerplate code to get to L4 info. Accessing sk
is a lot easier.
* It's not very useful without adding the reasons that would
correspond to listen drops in TCP. I'm not sure if I'm up to the task,
but I can give it a shot.
* It's unclear how to detect which end of the socket is bound locally.
I want to know which ports that are listened on locally are
experiencing issues, ignoring sockets that connect elsewhere. E.g. I
care about my http server dropping packets, but not as much about curl
doing the same.
* UDP drops seem to work okay in my local testing, I can see
SKB_DROP_REASON_SOCKET_RCVBUFF by port.

As a reminder, the code for my tracepoint is here:

* https://github.com/cloudflare/ebpf_exporter/pull/221

It's a lot simpler. I still feel that it's justified to exist.

Hope this helps.