netdev - Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <286508a3-3067-456d-8bbf-176b00dcc0c6@daynix.com>
Date: Mon, 9 Oct 2023 05:46:06 +0900
From: Akihiko Odaki <akihiko.odaki@...nix.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Jason Wang <jasowang@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
 virtualization@...ts.linux-foundation.org, linux-kselftest@...r.kernel.org,
 bpf@...r.kernel.org, davem@...emloft.net, kuba@...nel.org, ast@...nel.org,
 daniel@...earbox.net, andrii@...nel.org, kafai@...com,
 songliubraving@...com, yhs@...com, john.fastabend@...il.com,
 kpsingh@...nel.org, rdunlap@...radead.org, willemb@...gle.com,
 gustavoars@...nel.org, herbert@...dor.apana.org.au,
 steffen.klassert@...unet.com, nogikh@...gle.com, pablo@...filter.org,
 decui@...rosoft.com, jakub@...udflare.com, elver@...gle.com,
 pabeni@...hat.com, Yuri Benditovich <yuri.benditovich@...nix.com>
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 5:08, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>
>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but it makes
>>>> little sense to allow to implement different hashing algorithms with
>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>> the specification.
>>>>
>>>> The hash value already stored in sk_buff is not used and computed
>>>> independently since it may have been computed in a way not conformant
>>>> with the specification.
>>>>
>>>> Signed-off-by: Akihiko Odaki <akihiko.odaki@...nix.com>
>>>> ---
>>>
>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>> +       .max_indirection_table_length =
>>>> +               TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>> +
>>>> +       .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>> +};
>>>
>>> No need to have explicit capabilities exchange like this? Tun either
>>> supports all or none.
>>
>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>
>> It is because the flow dissector does not support IPv6 extensions. The
>> specification is also vague, and does not tell how many TLVs should be
>> consumed at most when interpreting destination option header so I chose
>> to avoid adding code for these hash types to the flow dissector. I doubt
>> anyone will complain about it since nobody complains for Linux.
>>
>> I'm also adding this so that we can extend it later.
>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>> types may have other bits for new protocols in the future.
>>
>>>
>>>>           case TUNSETSTEERINGEBPF:
>>>> -               ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> +               bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> +               if (IS_ERR(bpf_ret))
>>>> +                       ret = PTR_ERR(bpf_ret);
>>>> +               else if (bpf_ret)
>>>> +                       tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>
>>> Don't make one feature disable another.
>>>
>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>> functions. If one is enabled the other call should fail, with EBUSY
>>> for instance.
>>>
>>>> +       case TUNSETVNETHASH:
>>>> +               len = sizeof(vnet_hash);
>>>> +               if (copy_from_user(&vnet_hash, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>> +                    (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>> +                     !tun_is_little_endian(tun))) ||
>>>> +                    vnet_hash.indirection_table_mask >=
>>>> +                    TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>> +                       ret = -EINVAL;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               argp = (u8 __user *)argp + len;
>>>> +               len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>> +               if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>> +
>>>> +               argp = (u8 __user *)argp + len;
>>>> +               len = virtio_net_hash_key_length(vnet_hash.types);
>>>> +
>>>> +               if (copy_from_user(vnet_hash_key, argp, len)) {
>>>> +                       ret = -EFAULT;
>>>> +                       break;
>>>> +               }
>>>
>>> Probably easier and less error-prone to define a fixed size control
>>> struct with the max indirection table size.
>>
>> I made its size variable because the indirection table and key may grow
>> in the future as I wrote above.
>>
>>>
>>> Btw: please trim the CC: list considerably on future patches.
>>
>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>> proposed.
> 
> To be clear: please don't just resubmit with that one change.
> 
> The skb and cb issues are quite fundamental issues that need to be resolved.
> 
> I'd like to understand why adjusting the existing BPF feature for this
> exact purpose cannot be amended to return the key it produced.

eBPF steering program is not designed for this particular problem in my 
understanding. It was introduced to derive hash values with an 
understanding of application-specific semantics of packets instead of 
generic IP/TCP/UDP semantics.

This problem is rather different in terms that the hash derivation is 
strictly defined by virtio-net. I don't think it makes sense to 
introduce the complexity of BPF when you always run the same code.

It can utilize the existing flow dissector and also make it easier to 
use for the userspace by implementing this in the kernel.

> 
> As you point out, the C flow dissector is insufficient. The BPF flow
> dissector does not have this problem. The same argument would go for
> the pre-existing BPF steering program.
It is possible to extend the C flow dissector just as it is possible to 
implement a BPF flow dissector. The more serious problem is that 
virtio-net specification (and Microsoft RSS it follows) does not tell 
how to implement IPv6 extension support.