netdev - Re: Performance question: af_packet with bpf filter vs TX path skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d1b1c0cc-c542-e626-9f35-8ad0dabb56b0@kernel.org>
Date: Wed, 2 Aug 2023 16:30:18 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>, Maciej Żenczykowski
 <maze@...gle.com>
Cc: Linux NetDev <netdev@...r.kernel.org>, Pengtao He <hepengtao@...omi.com>,
 Willem Bruijn <willemb@...gle.com>, Stanislav Fomichev <sdf@...gle.com>,
 Xiao Ma <xiaom@...gle.com>, Patrick Rohr <prohr@...gle.com>,
 Alexei Starovoitov <ast@...nel.org>, Dave Tucker <datucker@...hat.com>,
 Vincent Bernat <vincent@...nat.ch>, Marek Majkowski <marek@...udflare.com>
Subject: Re: Performance question: af_packet with bpf filter vs TX path
 skb_clone

Hi Maze,

Great to see you on the netdev list again. I want to kickstart this
thread again, as I think it is a general netstack issue that should be
solved (I was on vacation when thread was active).

On 21/07/2023 20.14, Eric Dumazet wrote:
> On Fri, Jul 21, 2023 at 7:55 PM Maciej Żenczykowski <maze@...gle.com> wrote:
>>
>> I've been asked to review:
>>    https://android-review.googlesource.com/c/platform/packages/modules/NetworkStack/+/2648779
>>

So, this is blocking TCP zero-copy send feature, according to link.

>> where it comes to light that in Android due to background debugging of
>> connectivity problems
>> (of which there are *plenty* due to various types of buggy [primarily]
>> wifi networks)
>> we have a permanent AF_PACKET, ETH_P_ALL socket with a cBPF filter:
>>

Many userspace programs/daemons have a permanent AF_PACKET (sock_raw 
"tcpdump") socket running with a cBPF filter attached.

Examples of programs:
  - DHCP clients and servers.
  - LLDP (Link Layer Discovery Protocol) daemons (Cc. Vincent)
  - Path MTU daemons (https://github.com/cloudflare/pmtud/) (Cc Marek)
  - etc.

>>     arp or (ip and udp port 68) or (icmp6 and ip6[40] >= 133 and ip6[40] <= 136)
>>
>> ie. it catches ARP, IPv4 DHCP and IPv6 ND (NS/NA/RS/RA)
>>
>> If I'm reading the kernel code right this appears to cause skb_clone()
>> to be called on *every* outgoing packet,
>> even though most packets will not be accepted by the filter.
>>

So, you are saying the issue only occurs for TX ?

Would it be an option to change your AF_PACKET socket to ignore outgoing 
traffic?

For some of the daemons (listed above) it might be possible to ignore
outgoing packets, and thus not enable the TX hook and thus avoid the skb
cloning.


>> (In the TX path the filter appears to get called *after* the clone,
>> I think that's unlike the RX path where the filter is called first)
>>

I don't fully understand what you are saying here.
Is the RX path affected or not?


>> Unfortunately, I don't think it's possible to eliminate the
>> functionality this socket provides.
>> We need to be able to log RX & TX of ARP/DHCP/ND for debugging /
>> bugreports / etc.
>> and they *really* should be in order wrt. to each other.
>> (and yeah, that means last few minutes history when an issue happens,
>> so not possible to simply enable it on demand)
>>
>> We could of course split the socket into 3 separate ones:
>> - ETH_P_ARP
>> - ETH_P_IP + cbpf udp dport=dhcp
>> - ETH_P_IPV6 + cbpf icmpv6 type=NS/NA/RS/RA
>>
>> But I don't think that will help - I believe we'll still get
>> skb_clone() for every outbound ipv4/ipv6 packet.
>>

I assume this would not help, as it would travel same code path, to
dev_queue_xmit_nit, right?

>> I have some ideas for what could be done to avoid the clone (with
>> existing kernel functionality)... but none of it is pretty...
>> Anyone have any smart ideas?
>>
>> Perhaps a way to move the clone past the af_packet packet_rcv run_filter?
>> Unfortunately packet_rcv() does a little bit of 'setup' before it
>> calls the filter - so this may be hard.
> 
> 
> dev_queue_xmit_nit() also does some 'setup':
> 
> net_timestamp_set(skb2);  (This one could probably be moved into
> af_packet, if packet is not dropped ?)
> <sanitize mac, network, transport headers>
> 

Regarding AF_PACKET socket to ignore outgoing, I think the
(ptype->ignore_outgoing) in top of dev_queue_xmit_nit() list-loop is
doing that trick and thus avoids the skb_clone().

>>
>> Or an 'extra' early pre-filter hook [prot_hook.prefilter()] that has
>> very minimal
>> functionality... like match 2 bytes at an offset into the packet?
>> Maybe even not a hook at all, just adding a
>> prot_hook.prefilter{1,2}_u64_{offset,mask,value}
>> It doesn't have to be perfect, but if it could discard 99% of the
>> packets we don't care about...
>> (and leave filtering of the remaining 1% to the existing cbpf program)
>> that would already be a huge win?
> 
> Maybe if we can detect a cBPF filter does not access mac, network,
> transport header,
> we could run it earlier, before the clone().
> 
> So we could add
> prot_hook.filter_can_run_from_dev_queue_xmit_nit_before_the_clone
> 
> Or maybe we can remove sanitization, because BPF should not do bad
> things if these headers are garbage ?
>

To Maze, have you looked at PoC coding what Eric suggested?
(Prework that allows us to to move filter)


>>
>> Thoughts?
>>

What are your plans for working on a solution for this?

--Jesper

Thread link[1] to people Cc'ed:
  [1] 
https://lore.kernel.org/all/CANP3RGfRA3yfom8GOxUBZD4sBxiU2dWn9TKdR50d55WgENrGnQ@mail.gmail.com/