netdev - Performance question: af_packet with bpf filter vs TX path skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANP3RGfRA3yfom8GOxUBZD4sBxiU2dWn9TKdR50d55WgENrGnQ@mail.gmail.com>
Date: Fri, 21 Jul 2023 19:55:28 +0200
From: Maciej Żenczykowski <maze@...gle.com>
To: Linux NetDev <netdev@...r.kernel.org>, Jesper Dangaard Brouer <brouer@...hat.com>, 
	Eric Dumazet <edumazet@...gle.com>, Pengtao He <hepengtao@...omi.com>, 
	Willem Bruijn <willemb@...gle.com>, Stanislav Fomichev <sdf@...gle.com>, Xiao Ma <xiaom@...gle.com>, 
	Patrick Rohr <prohr@...gle.com>, Alexei Starovoitov <ast@...nel.org>
Subject: Performance question: af_packet with bpf filter vs TX path skb_clone

I've been asked to review:
  https://android-review.googlesource.com/c/platform/packages/modules/NetworkStack/+/2648779

where it comes to light that in Android due to background debugging of
connectivity problems
(of which there are *plenty* due to various types of buggy [primarily]
wifi networks)
we have a permanent AF_PACKET, ETH_P_ALL socket with a cBPF filter:

   arp or (ip and udp port 68) or (icmp6 and ip6[40] >= 133 and ip6[40] <= 136)

ie. it catches ARP, IPv4 DHCP and IPv6 ND (NS/NA/RS/RA)

If I'm reading the kernel code right this appears to cause skb_clone()
to be called on *every* outgoing packet,
even though most packets will not be accepted by the filter.

(In the TX path the filter appears to get called *after* the clone,
I think that's unlike the RX path where the filter is called first)

Unfortunately, I don't think it's possible to eliminate the
functionality this socket provides.
We need to be able to log RX & TX of ARP/DHCP/ND for debugging /
bugreports / etc.
and they *really* should be in order wrt. to each other.
(and yeah, that means last few minutes history when an issue happens,
so not possible to simply enable it on demand)

We could of course split the socket into 3 separate ones:
- ETH_P_ARP
- ETH_P_IP + cbpf udp dport=dhcp
- ETH_P_IPV6 + cbpf icmpv6 type=NS/NA/RS/RA

But I don't think that will help - I believe we'll still get
skb_clone() for every outbound ipv4/ipv6 packet.

I have some ideas for what could be done to avoid the clone (with
existing kernel functionality)... but none of it is pretty...
Anyone have any smart ideas?

Perhaps a way to move the clone past the af_packet packet_rcv run_filter?
Unfortunately packet_rcv() does a little bit of 'setup' before it
calls the filter - so this may be hard.

Or an 'extra' early pre-filter hook [prot_hook.prefilter()] that has
very minimal
functionality... like match 2 bytes at an offset into the packet?
Maybe even not a hook at all, just adding a
prot_hook.prefilter{1,2}_u64_{offset,mask,value}
It doesn't have to be perfect, but if it could discard 99% of the
packets we don't care about...
(and leave filtering of the remaining 1% to the existing cbpf program)
that would already be a huge win?

Thoughts?

Thanks,
Maciej