netdev - Re: Performance question: af_packet with bpf filter vs TX path skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANP3RGfFAjEDWbLAcmMcz63XUV6=djqZNsMikrqvA-i9K-4pAg@mail.gmail.com>
Date: Fri, 21 Jul 2023 20:24:48 +0200
From: Maciej Żenczykowski <maze@...gle.com>
To: Stanislav Fomichev <sdf@...gle.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Linux NetDev <netdev@...r.kernel.org>, 
	Jesper Dangaard Brouer <brouer@...hat.com>, Pengtao He <hepengtao@...omi.com>, 
	Willem Bruijn <willemb@...gle.com>, Xiao Ma <xiaom@...gle.com>, Patrick Rohr <prohr@...gle.com>, 
	Alexei Starovoitov <ast@...nel.org>
Subject: Re: Performance question: af_packet with bpf filter vs TX path skb_clone

On Fri, Jul 21, 2023 at 8:18 PM Stanislav Fomichev <sdf@...gle.com> wrote:
>
> On Fri, Jul 21, 2023 at 11:14 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Fri, Jul 21, 2023 at 7:55 PM Maciej Żenczykowski <maze@...gle.com> wrote:
> > >
> > > I've been asked to review:
> > >   https://android-review.googlesource.com/c/platform/packages/modules/NetworkStack/+/2648779
> > >
> > > where it comes to light that in Android due to background debugging of
> > > connectivity problems
> > > (of which there are *plenty* due to various types of buggy [primarily]
> > > wifi networks)
> > > we have a permanent AF_PACKET, ETH_P_ALL socket with a cBPF filter:
> > >
> > >    arp or (ip and udp port 68) or (icmp6 and ip6[40] >= 133 and ip6[40] <= 136)
> > >
> > > ie. it catches ARP, IPv4 DHCP and IPv6 ND (NS/NA/RS/RA)
> > >
> > > If I'm reading the kernel code right this appears to cause skb_clone()
> > > to be called on *every* outgoing packet,
> > > even though most packets will not be accepted by the filter.
> > >
> > > (In the TX path the filter appears to get called *after* the clone,
> > > I think that's unlike the RX path where the filter is called first)
> > >
> > > Unfortunately, I don't think it's possible to eliminate the
> > > functionality this socket provides.
> > > We need to be able to log RX & TX of ARP/DHCP/ND for debugging /
> > > bugreports / etc.
> > > and they *really* should be in order wrt. to each other.
> > > (and yeah, that means last few minutes history when an issue happens,
> > > so not possible to simply enable it on demand)
> > >
> > > We could of course split the socket into 3 separate ones:
> > > - ETH_P_ARP
> > > - ETH_P_IP + cbpf udp dport=dhcp
> > > - ETH_P_IPV6 + cbpf icmpv6 type=NS/NA/RS/RA
> > >
> > > But I don't think that will help - I believe we'll still get
> > > skb_clone() for every outbound ipv4/ipv6 packet.
> > >
> > > I have some ideas for what could be done to avoid the clone (with
> > > existing kernel functionality)... but none of it is pretty...
> > > Anyone have any smart ideas?
> > >
> > > Perhaps a way to move the clone past the af_packet packet_rcv run_filter?
> > > Unfortunately packet_rcv() does a little bit of 'setup' before it
> > > calls the filter - so this may be hard.
> >
> >
> > dev_queue_xmit_nit() also does some 'setup':
> >
> > net_timestamp_set(skb2);  (This one could probably be moved into
> > af_packet, if packet is not dropped ?)
> > <sanitize mac, network, transport headers>
> >
> > >
> > > Or an 'extra' early pre-filter hook [prot_hook.prefilter()] that has
> > > very minimal
> > > functionality... like match 2 bytes at an offset into the packet?
> > > Maybe even not a hook at all, just adding a
> > > prot_hook.prefilter{1,2}_u64_{offset,mask,value}
> > > It doesn't have to be perfect, but if it could discard 99% of the
> > > packets we don't care about...
> > > (and leave filtering of the remaining 1% to the existing cbpf program)
> > > that would already be a huge win?
> >
> > Maybe if we can detect a cBPF filter does not access mac, network,
> > transport header,
> > we could run it earlier, before the clone().
> >
> > So we could add
> > prot_hook.filter_can_run_from_dev_queue_xmit_nit_before_the_clone
> >
> > Or maybe we can remove sanitization, because BPF should not do bad
> > things if these headers are garbage ?
>
> eBPF is already doing those sorts of checks, so maybe another option
> is to convert this filter to ebpf tc/egress program?

Yeah, I've considered tc ingress/egress + bpf ring buffer.

This unfortunately is a fair bit of pain to do:

- it requires a new enough kernel (5.~8 ifirc), so we'd have to keep
the old code
around for 4.9 which we still have to support for a few (5?) more years.

- it needs to be done on a per device basis...
(devices have dynamic lifetimes on Android, and we don't necessarily
even know about all of them,
though perhaps it would be ok to not receive packets on those...)