[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f437d2d6-e4a2-4539-bd30-f312bbf0eac8@daynix.com>
Date: Tue, 1 Oct 2024 14:54:29 +0900
From: Akihiko Odaki <akihiko.odaki@...nix.com>
To: Stephen Hemminger <stephen@...workplumber.org>
Cc: Jason Wang <jasowang@...hat.com>, Jonathan Corbet <corbet@....net>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>, Xuan Zhuo
<xuanzhuo@...ux.alibaba.com>, Shuah Khan <shuah@...nel.org>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org, kvm@...r.kernel.org,
virtualization@...ts.linux-foundation.org, linux-kselftest@...r.kernel.org,
Yuri Benditovich <yuri.benditovich@...nix.com>,
Andrew Melnychenko <andrew@...nix.com>, gur.stavi@...wei.com
Subject: Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature
On 2024/09/30 0:33, Stephen Hemminger wrote:
> On Sun, 29 Sep 2024 16:10:47 +0900
> Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>
>> On 2024/09/29 11:07, Jason Wang wrote:
>>> On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>>>
>>>> On 2024/09/27 13:31, Jason Wang wrote:
>>>>> On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>>>>>
>>>>>> On 2024/09/25 12:30, Jason Wang wrote:
>>>>>>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>>>>>>>
>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>> purpose of RSS.
>>>>>>>>
>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>> restrictive nature of eBPF.
>>>>>>>>
>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>> thse challenges.
>>>>>>>>
>>>>>>>> An alternative solution is to extend the eBPF steering program so that it
>>>>>>>> will be able to report to the userspace, but it is based on context
>>>>>>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
>>>>>>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
>>>>>>>> and vhost_net).
>>>>>>>>
>>>>>>>
>>>>>>> I wonder if we could clone the skb and reuse some to store the hash,
>>>>>>> then the steering eBPF program can access these fields without
>>>>>>> introducing full RSS in the kernel?
>>>>>>
>>>>>> I don't get how cloning the skb can solve the issue.
>>>>>>
>>>>>> We can certainly implement Toeplitz function in the kernel or even with
>>>>>> tc-bpf to store a hash value that can be used for eBPF steering program
>>>>>> and virtio hash reporting. However we don't have a means of storing a
>>>>>> hash type, which is specific to virtio hash reporting and lacks a
>>>>>> corresponding skb field.
>>>>>
>>>>> I may miss something but looking at sk_filter_is_valid_access(). It
>>>>> looks to me we can make use of skb->cb[0..4]?
>>>>
>>>> I didn't opt to using cb. Below is the rationale:
>>>>
>>>> cb is for tail call so it means we reuse the field for a different
>>>> purpose. The context rewrite allows adding a field without increasing
>>>> the size of the underlying storage (the real sk_buff) so we should add a
>>>> new field instead of reusing an existing field to avoid confusion.
>>>>
>>>> We are however no longer allowed to add a new field. In my
>>>> understanding, this is because it is an UAPI, and eBPF maintainers found
>>>> it is difficult to maintain its stability.
>>>>
>>>> Reusing cb for hash reporting is a workaround to avoid having a new
>>>> field, but it does not solve the underlying problem (i.e., keeping eBPF
>>>> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
>>>> is a reasonable option to keep the API as stable as other virtualization
>>>> UAPIs while respecting the underlying intention of the context rewrite
>>>> feature freeze.
>>>
>>> Fair enough.
>>>
>>> Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
>>> via cls or other). It might worth to see if anything we miss here.
>>
>> Thanks for the information. I wonder why they used cls instead of
>> steering program. Perhaps it may be due to compatibility with macvtap
>> and ipvtap, which don't steering program.
>>
>> Their RSS implementation looks cleaner so I will improve my RSS
>> implementation accordingly.
>>
>
> DPDK needs to support flow rules. The specific case is where packets
> are classified by a flow, then RSS is done across a subset of the queues.
> The support for flow in TUN driver is more academic than useful,
> I fixed it for current BPF, but doubt anyone is using it really.
>
> A full steering program would be good, but would require much more
> complexity to take a general set of flow rules then communicate that
> to the steering program.
>
It reminded me of RSS context and flow filter. Some physical NICs
support to use a dedicated RSS context for packets matched with flow
filter, and virtio is also gaining corresponding features.
RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179
I considered about the possibility of supporting these features with tc
instead of adding ioctls to tuntap, but it seems not appropriate for
virtualization use case.
In a virtualization use case, tuntap is configured according to requests
of guests, and the code processing these requests need to have minimal
permissions for security. This goal is achieved by passing a file
descriptor that represents a tuntap from a privileged process (e.g.,
libvirt) to the process handling guest requests (e.g., QEMU).
However, tc is configured with rtnetlink, which does not seem to have an
interface to delegate a permission for one particular device to another
process.
For now I'll continue working on the current approach that is based on
ioctl and lacks RSS context and flow filter features. Eventually they
are also likely to require new ioctls if they are to be supported with
vhost_net.
Regards,
Akihiko Odaki
Powered by blists - more mailing lists