lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 21 Jul 2023 21:12:17 +0200
From: Maciej Żenczykowski <maze@...gle.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Stanislav Fomichev <sdf@...gle.com>, Eric Dumazet <edumazet@...gle.com>, 
	Linux NetDev <netdev@...r.kernel.org>, Jesper Dangaard Brouer <brouer@...hat.com>, 
	Pengtao He <hepengtao@...omi.com>, Willem Bruijn <willemb@...gle.com>, Xiao Ma <xiaom@...gle.com>, 
	Patrick Rohr <prohr@...gle.com>, Alexei Starovoitov <ast@...nel.org>
Subject: Re: Performance question: af_packet with bpf filter vs TX path skb_clone

On Fri, Jul 21, 2023 at 8:56 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Maciej Żenczykowski wrote:
> > On Fri, Jul 21, 2023 at 8:18 PM Stanislav Fomichev <sdf@...gle.com> wrote:
> > >
> > > On Fri, Jul 21, 2023 at 11:14 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Fri, Jul 21, 2023 at 7:55 PM Maciej Żenczykowski <maze@...gle.com> wrote:
> > > > >
> > > > > I've been asked to review:
> > > > >   https://android-review.googlesource.com/c/platform/packages/modules/NetworkStack/+/2648779
> > > > >
> > > > > where it comes to light that in Android due to background debugging of
> > > > > connectivity problems
> > > > > (of which there are *plenty* due to various types of buggy [primarily]
> > > > > wifi networks)
> > > > > we have a permanent AF_PACKET, ETH_P_ALL socket with a cBPF filter:
> > > > >
> > > > >    arp or (ip and udp port 68) or (icmp6 and ip6[40] >= 133 and ip6[40] <= 136)
> > > > >
> > > > > ie. it catches ARP, IPv4 DHCP and IPv6 ND (NS/NA/RS/RA)
> > > > >
> > > > > If I'm reading the kernel code right this appears to cause skb_clone()
> > > > > to be called on *every* outgoing packet,
> > > > > even though most packets will not be accepted by the filter.
> > > > >
> > > > > (In the TX path the filter appears to get called *after* the clone,
> > > > > I think that's unlike the RX path where the filter is called first)
> > > > >
> > > > > Unfortunately, I don't think it's possible to eliminate the
> > > > > functionality this socket provides.
> > > > > We need to be able to log RX & TX of ARP/DHCP/ND for debugging /
> > > > > bugreports / etc.
> > > > > and they *really* should be in order wrt. to each other.
> > > > > (and yeah, that means last few minutes history when an issue happens,
> > > > > so not possible to simply enable it on demand)
> > > > >
> > > > > We could of course split the socket into 3 separate ones:
> > > > > - ETH_P_ARP
> > > > > - ETH_P_IP + cbpf udp dport=dhcp
> > > > > - ETH_P_IPV6 + cbpf icmpv6 type=NS/NA/RS/RA
> > > > >
> > > > > But I don't think that will help - I believe we'll still get
> > > > > skb_clone() for every outbound ipv4/ipv6 packet.
> > > > >
> > > > > I have some ideas for what could be done to avoid the clone (with
> > > > > existing kernel functionality)... but none of it is pretty...
> > > > > Anyone have any smart ideas?
> > > > >
> > > > > Perhaps a way to move the clone past the af_packet packet_rcv run_filter?
> > > > > Unfortunately packet_rcv() does a little bit of 'setup' before it
> > > > > calls the filter - so this may be hard.
> > > >
> > > >
> > > > dev_queue_xmit_nit() also does some 'setup':
> > > >
> > > > net_timestamp_set(skb2);  (This one could probably be moved into
> > > > af_packet, if packet is not dropped ?)
> > > > <sanitize mac, network, transport headers>
> > > >
> > > > >
> > > > > Or an 'extra' early pre-filter hook [prot_hook.prefilter()] that has
> > > > > very minimal
> > > > > functionality... like match 2 bytes at an offset into the packet?
> > > > > Maybe even not a hook at all, just adding a
> > > > > prot_hook.prefilter{1,2}_u64_{offset,mask,value}
> > > > > It doesn't have to be perfect, but if it could discard 99% of the
> > > > > packets we don't care about...
> > > > > (and leave filtering of the remaining 1% to the existing cbpf program)
> > > > > that would already be a huge win?
> > > >
> > > > Maybe if we can detect a cBPF filter does not access mac, network,
> > > > transport header,
> > > > we could run it earlier, before the clone().
> > > >
> > > > So we could add
> > > > prot_hook.filter_can_run_from_dev_queue_xmit_nit_before_the_clone
> > > >
> > > > Or maybe we can remove sanitization, because BPF should not do bad
> > > > things if these headers are garbage ?
> > >
> > > eBPF is already doing those sorts of checks, so maybe another option
> > > is to convert this filter to ebpf tc/egress program?
> >
> > Yeah, I've considered tc ingress/egress + bpf ring buffer.
> >
> > This unfortunately is a fair bit of pain to do:
> >
> > - it requires a new enough kernel (5.~8 ifirc), so we'd have to keep
> > the old code
> > around for 4.9 which we still have to support for a few (5?) more years.
>
> Wouldn't any kernel patch to net-next have the same issue?
>
> Another hack might be to use tc egress bpf or even u32 plus tc_mirred to
> redirect only interesting packets to an ifb virtual device, and only
> attach the packet socket there.

Not if it's an 'improvement' that would either work automatically
on any devices that take the kernel fix,

Or a feature that we could enable via some setsockopt() and just ignore failures
(for older kernels that don't support it)

Sure older kernels/devices wouldn't get the benefit, but whatever...
they wouldn't regress, just wouldn't improve.

> > - it needs to be done on a per device basis...
> > (devices have dynamic lifetimes on Android, and we don't necessarily
> > even know about all of them,
> > though perhaps it would be ok to not receive packets on those...)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ