netdev - Re: Performance question: af_packet with bpf filter vs TX path skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CANP3RGfpY28TQmzr=yBAS9qt3Tq9=cmjmj8j_gWzau0nb8VaQQ@mail.gmail.com>
Date: Sat, 5 Aug 2023 12:09:08 +0200
From: Maciej Żenczykowski <maze@...gle.com>
To: Vincent Bernat <vincent@...nat.ch>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, Eric Dumazet <edumazet@...gle.com>, 
	Linux NetDev <netdev@...r.kernel.org>, Pengtao He <hepengtao@...omi.com>, 
	Willem Bruijn <willemb@...gle.com>, Stanislav Fomichev <sdf@...gle.com>, Xiao Ma <xiaom@...gle.com>, 
	Patrick Rohr <prohr@...gle.com>, Alexei Starovoitov <ast@...nel.org>, Dave Tucker <datucker@...hat.com>, 
	Marek Majkowski <marek@...udflare.com>
Subject: Re: Performance question: af_packet with bpf filter vs TX path skb_clone

On Sat, Aug 5, 2023 at 10:55 AM Vincent Bernat <vincent@...nat.ch> wrote:
> On 2023-08-03 10:46, Maciej Żenczykowski wrote:
> > I think a fair number of these can get by with non-ETH_P_ALL (for
> > example ETH_P_LLDP), or can use a different socket for RX (where you can
> > choose to not see your own TX packets) and transmit via ETH_P_NONE (btw.
> > that constant should really exist and be equal to 0)
>
> For lldpd, I was using ETH_P_LLDP in the past, but there was cases where
> packets are not received, notably when an interface is enslaved by an
> Open vSwitch. See:
> https://github.com/lldpd/lldpd/commit/8b50be7f61ad20ebae15372a509f7e778da2cc6f
>
> This may have been fixed, but this kind of differences between ETH_P_ALL
> and ETH_P_LLDP makes it difficult to trust ETH_P_LLDP to do the right
> thing as it will work for most people but a few edge cases may appear.

This *may* be fixed now - or may not - see what I wrote earlier,
as we (Google's host networking team for servers) ran into
(5+ years ago) somewhat similar problems with link local macs and
inactive bonding slaves...

However, it is still very much the case that ETH_P_ALL and ETH_P_X
hook in slightly different spots,
for example wrt. tc ingress bpf packet mangling...

Anyway, this lack of certainty, is really making me want to add a:
  int fd = socket(AF_PACKET, SOCK_RAW, ETH_P_ALL);
  be16 ethertype = htons(ETH_P_LLDP);
  setsockopt(fd, SOL_PACKET, PACKET_ETHERTYPE_FILTER, &ethertype, 2);
optimization/hint.

By not being a bpf filter, this would be easy to process prior to the
skb_clone...

You could of course ignore the failure of this setsockopt (thus
supporting older kernels),
and *still* attach a 4 instruction cbpf filtering on skb->protocol ==
htons(ETH_P_LLDP).

We could even declare that the api is a hint/optimization and not guaranteed to
fully filter things out...
For example we could use this hint to filter on TX but not on RX...
(ie. you still need the cbpf anyway to do guaranteed filtering just
like you did on older kernels).

This would also potentially fix the Android use case.
Split the socket into 3 sockets, attach PACKET_ETHERTYPE_FILTER of
appropriate type to each.
[though we'd want to go further and also add some u64 mask/value
filter at packet/mac/net offset X extra hint too]

*OR* we could try to not introduce an API for this at all, and instead
try to parse the first few instructions
of the cbpf program, detect some simple patterns, and use that to prefilter...

for example, if the cbpf filter begins with the 3 instructions
(writing from memory):
  LD H ABS SKF_NET_PROTOCOL // ie. A := skb->protocol
  if (A == ETH_P_LLDP) jump +1 // ie. skip next instruction
  ret 0 // ie. reject

then we could automatically set this 'extra' hook filter to only grab
skb->protocol == ETH_P_LLDP...

I think the most common cases could probably be fixed by some pattern
matching on the first ~10 cbpf instructions.
I'd envision:
- match on ethertype
- match on ipv4 / ipv6 protocol
- match on udp/udplite/tcp/sctp/dccp source and/or destination port
(the above would I think be enough for Android)
and maybe:
- match on src and/or dst ip address

[note: there are some annoyances wrt. IPv4 options (and potentially
IPv6 extension headers) and matching on ports]

It would of course be pointless if we could get the bpf filter running
prior to the clone,
but that (at least to me) seems a *much* harder and open-ended problem.

But there are miracle workers among us :-)