[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <427fbc1d-6936-23ec-553f-21d5351fdbf7@nbd.name>
Date: Tue, 15 Mar 2022 08:25:13 +0100
From: Felix Fietkau <nbd@....name>
To: Jesper Dangaard Brouer <jbrouer@...hat.com>,
Daniel Borkmann <daniel@...earbox.net>,
Toke Høiland-Jørgensen <toke@...hat.com>,
"Jesper D. Brouer" <netdev@...uer.com>, netdev@...r.kernel.org,
bpf <bpf@...r.kernel.org>,
Lorenzo Bianconi <lorenzo.bianconi@...hat.com>
Cc: brouer@...hat.com, John Fastabend <john.fastabend@...il.com>,
Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH] net: xdp: allow user space to request a smaller packet
headroom requirement
On 15.03.22 07:50, Jesper Dangaard Brouer wrote:
> On 14/03/2022 23.43, Felix Fietkau wrote:
>>
>> On 14.03.22 23:20, Daniel Borkmann wrote:
>>> On 3/14/22 11:16 PM, Toke Høiland-Jørgensen wrote:
>>>> Felix Fietkau <nbd@....name> writes:
>>>>> On 14.03.22 21:39, Jesper D. Brouer wrote:
>>>>>> (Cc. BPF list and other XDP maintainers)
>>>>>> On 14/03/2022 11.22, Felix Fietkau wrote:
>>>>>>> Most ethernet drivers allocate a packet headroom of NET_SKB_PAD.
>>>>>>> Since it is
>>>>>>> rounded up to L1 cache size, it ends up being at least 64 bytes on
>>>>>>> the most
>>>>>>> common platforms.
>>>>>>> On most ethernet drivers, having a guaranteed headroom of 256
>>>>>>> bytes for XDP
>>>>>>> adds an extra forced pskb_expand_head call when enabling SKB XDP,
>>>>>>> which can
>>>>>>> be quite expensive.
>>>>>>> Many XDP programs need only very little headroom, so it can be
>>>>>>> beneficial
>>>>>>> to have a way to opt-out of the 256 bytes headroom requirement.
>>>>>>
>>>>>> IMHO 64 bytes is too small.
>>>>>> We are using this area for struct xdp_frame and also for metadata
>>>>>> (XDP-hints). This will limit us from growing this structures for
>>>>>> the sake of generic-XDP.
>>>>>>
>>>>>> I'm fine with reducting this to 192 bytes, as most Intel drivers
>>>>>> have this headroom, and have defacto established that this is
>>>>>> a valid XDP headroom, even for native-XDP.
>>>>>>
>>>>>> We could go a small as two cachelines 128 bytes, as if xdp_frame
>>>>>> and metadata grows above a cache-line (64 bytes) each, then we have
>>>>>> done something wrong (performance wise).
> >>>>
>>>>> Here's some background on why I chose 64 bytes: I'm currently
>>>>> implementing a userspace + xdp program to act as generic fastpath to
>>>>> speed network bridging.
>
> Cc. Lorenzo as you mention you wanted to accelerate Linux bridging.
> We can avoid a lot of external FIB sync in userspace, if we
> add some BPF-helpers to lookup in bridge FIB table.
I talked to Lorenzo a lot about my approach already. If we do the FIB
lookup from BPF and add all the necessary checks to handle the
forwarding path, I believe the result is going to look a lot like the
existing bridge code, and would not make it significantly faster.
My approach involves creating a map that caches src-mac + dest-mac +
vlan "flows", along with the target port + vlan.
User space then takes care of flushing those entries based on FDB, port
and config changes.
I already prototyped this with an in-kernel implementation directly in
the bridge code, and it led to a 6-10% CPU usage reduction.
Preliminary tests indicate that with my user space + XDP implementation
and the headroom modification, I'm getting comparable performance.
>>>> Any reason this can't run in the TC ingress hook instead? Generic XDP is
>>>> a bit of an odd duck, and I'm not a huge fan of special-casing it this
>>>> way...
>>>
>>> +1, would have been fine with generic reduction to just down to 192 bytes
>>> (though not less than that), but 64 is a bit too little.
>
> +1
>
>>> Also curious on why not tc ingress instead?
> >
>> I chose XDP because of bpf_redirect_map, which doesn't seem to be
>> available to tc ingress classifier programs.
>
> TC have a "normal" bpf_redirect which is slower than a bpf_redirect_map.
> The secret to the performance boost from bpf_redirect_map is the hidden
> TX bulking layer. I have experimented with TC bulking, which showed a
> 30% performance boost on TC redirect to TX with fixed 32 frame bulking.
>
> Notice that newer libbpf makes it easier to attach to the TC hook
> without shell'ing out to 'tc' binary. See how to use in this example:
>
> https://github.com/xdp-project/bpf-examples/blob/master/tc-policy/tc_txq_policy.c
>
>
>> When I started writing the code, I didn't know that generic XDP
>> performance would be bad on pretty much any ethernet/WLAN driver that
>> wasn't updated to support it.
>
> Yes, we kind of kept generic-XDP non-optimized as the real tool for
> the job is TC-BPF, given at this stage the SKB have already been
> allocate, and converting it back to a XDP representation is not
> going to be faster that TC-BPF.
>
> Maybe we should optimize generic-XDP a bit more, because I'm
> buying the argument, that developers want to write one BPF program
> that works across device drivers, instead of having to handle
> both TC-BPF and XDP-BPF.
>
> Notice that generic-XDP doesn't actually do the bulking step, even
> though using bpf_redirect_map. (would be obvious optimization).
>
> If we extend kernel/bpf.devmap.c with bulking for SKBs, then both
> TC-BPF and generic-XDP could share that code path, and obtain
> the TX bulking performance boost via SKB-list + xmit_more.
Thanks for the information. I guess I'll rewrite my code to use TC-BPF now.
- Felix
Powered by blists - more mailing lists