netdev - Re: [PATCH] net: xdp: allow user space to request a smaller packet headroom requirement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <427fbc1d-6936-23ec-553f-21d5351fdbf7@nbd.name>
Date:   Tue, 15 Mar 2022 08:25:13 +0100
From:   Felix Fietkau <nbd@....name>
To:     Jesper Dangaard Brouer <jbrouer@...hat.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        "Jesper D. Brouer" <netdev@...uer.com>, netdev@...r.kernel.org,
        bpf <bpf@...r.kernel.org>,
        Lorenzo Bianconi <lorenzo.bianconi@...hat.com>
Cc:     brouer@...hat.com, John Fastabend <john.fastabend@...il.com>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH] net: xdp: allow user space to request a smaller packet
 headroom requirement


On 15.03.22 07:50, Jesper Dangaard Brouer wrote:
> On 14/03/2022 23.43, Felix Fietkau wrote:
>> 
>> On 14.03.22 23:20, Daniel Borkmann wrote:
>>> On 3/14/22 11:16 PM, Toke Høiland-Jørgensen wrote:
>>>> Felix Fietkau <nbd@....name> writes:
>>>>> On 14.03.22 21:39, Jesper D. Brouer wrote:
>>>>>> (Cc. BPF list and other XDP maintainers)
>>>>>> On 14/03/2022 11.22, Felix Fietkau wrote:
>>>>>>> Most ethernet drivers allocate a packet headroom of NET_SKB_PAD. 
>>>>>>> Since it is
>>>>>>> rounded up to L1 cache size, it ends up being at least 64 bytes on 
>>>>>>> the most
>>>>>>> common platforms.
>>>>>>> On most ethernet drivers, having a guaranteed headroom of 256 
>>>>>>> bytes for XDP
>>>>>>> adds an extra forced pskb_expand_head call when enabling SKB XDP, 
>>>>>>> which can
>>>>>>> be quite expensive.
>>>>>>> Many XDP programs need only very little headroom, so it can be 
>>>>>>> beneficial
>>>>>>> to have a way to opt-out of the 256 bytes headroom requirement.
>>>>>>
>>>>>> IMHO 64 bytes is too small.
>>>>>> We are using this area for struct xdp_frame and also for metadata
>>>>>> (XDP-hints).  This will limit us from growing this structures for
>>>>>> the sake of generic-XDP.
>>>>>>
>>>>>> I'm fine with reducting this to 192 bytes, as most Intel drivers
>>>>>> have this headroom, and have defacto established that this is
>>>>>> a valid XDP headroom, even for native-XDP.
>>>>>>
>>>>>> We could go a small as two cachelines 128 bytes, as if xdp_frame
>>>>>> and metadata grows above a cache-line (64 bytes) each, then we have
>>>>>> done something wrong (performance wise).
>   >>>>
>>>>> Here's some background on why I chose 64 bytes: I'm currently
>>>>> implementing a userspace + xdp program to act as generic fastpath to
>>>>> speed network bridging.
> 
> Cc. Lorenzo as you mention you wanted to accelerate Linux bridging.
> We can avoid a lot of external FIB sync in userspace, if we
> add some BPF-helpers to lookup in bridge FIB table.
I talked to Lorenzo a lot about my approach already. If we do the FIB 
lookup from BPF and add all the necessary checks to handle the 
forwarding path, I believe the result is going to look a lot like the 
existing bridge code, and would not make it significantly faster.

My approach involves creating a map that caches src-mac + dest-mac + 
vlan "flows", along with the target port + vlan.
User space then takes care of flushing those entries based on FDB, port 
and config changes.
I already prototyped this with an in-kernel implementation directly in 
the bridge code, and it led to a 6-10% CPU usage reduction.
Preliminary tests indicate that with my user space + XDP implementation 
and the headroom modification, I'm getting comparable performance.

>>>> Any reason this can't run in the TC ingress hook instead? Generic XDP is
>>>> a bit of an odd duck, and I'm not a huge fan of special-casing it this
>>>> way...
>>>
>>> +1, would have been fine with generic reduction to just down to 192 bytes
>>> (though not less than that), but 64 is a bit too little. 
> 
> +1
> 
>>> Also curious on why not tc ingress instead?
>   >
>> I chose XDP because of bpf_redirect_map, which doesn't seem to be 
>> available to tc ingress classifier programs.
> 
> TC have a "normal" bpf_redirect which is slower than a bpf_redirect_map.
> The secret to the performance boost from bpf_redirect_map is the hidden
> TX bulking layer.  I have experimented with TC bulking, which showed a
> 30% performance boost on TC redirect to TX with fixed 32 frame bulking.
> 
> Notice that newer libbpf makes it easier to attach to the TC hook
> without shell'ing out to 'tc' binary. See how to use in this example:
>   
> https://github.com/xdp-project/bpf-examples/blob/master/tc-policy/tc_txq_policy.c
> 
> 
>> When I started writing the code, I didn't know that generic XDP 
>> performance would be bad on pretty much any ethernet/WLAN driver that 
>> wasn't updated to support it.
> 
> Yes, we kind of kept generic-XDP non-optimized as the real tool for
> the  job is TC-BPF, given at this stage the SKB have already been
> allocate, and converting it back to a XDP representation is not
> going to be faster that TC-BPF.
> 
> Maybe we should optimize generic-XDP a bit more, because I'm
> buying the argument, that developers want to write one BPF program
> that works across device drivers, instead of having to handle
> both TC-BPF and XDP-BPF.
> 
> Notice that generic-XDP doesn't actually do the bulking step, even
> though using bpf_redirect_map. (would be obvious optimization).
> 
> If we extend kernel/bpf.devmap.c with bulking for SKBs, then both
> TC-BPF and generic-XDP could share that code path, and obtain
> the TX bulking performance boost via SKB-list + xmit_more.
Thanks for the information. I guess I'll rewrite my code to use TC-BPF now.

- Felix