netdev - Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6253fb6b-9a53-484a-9be5-8facd46c051e@daynix.com>
Date: Sat, 18 Nov 2023 19:38:58 +0900
From: Akihiko Odaki <akihiko.odaki@...nix.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
 Jason Wang <jasowang@...hat.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
 Martin KaFai Lau <martin.lau@...ux.dev>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Jonathan Corbet <corbet@....net>,
 Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 "Michael S. Tsirkin" <mst@...hat.com>, Xuan Zhuo
 <xuanzhuo@...ux.alibaba.com>, Mykola Lysenko <mykolal@...com>,
 Shuah Khan <shuah@...nel.org>, bpf <bpf@...r.kernel.org>,
 "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>,
 Network Development <netdev@...r.kernel.org>, kvm@...r.kernel.org,
 virtualization@...ts.linux-foundation.org,
 "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>,
 Yuri Benditovich <yuri.benditovich@...nix.com>,
 Andrew Melnychenko <andrew@...nix.com>
Subject: Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH

On 2023/10/18 4:19, Akihiko Odaki wrote:
> On 2023/10/18 4:03, Alexei Starovoitov wrote:
>> On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@...hat.com> wrote:
>>>
>>> On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
>>> <alexei.starovoitov@...il.com> wrote:
>>>>
>>>> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki 
>>>> <akihiko.odaki@...nix.com> wrote:
>>>>>
>>>>> On 2023/10/16 1:07, Alexei Starovoitov wrote:
>>>>>> On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki 
>>>>>> <akihiko.odaki@...nix.com> wrote:
>>>>>>>
>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>> index 0448700890f7..298634556fab 100644
>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>> @@ -988,6 +988,7 @@ enum bpf_prog_type {
>>>>>>>           BPF_PROG_TYPE_SK_LOOKUP,
>>>>>>>           BPF_PROG_TYPE_SYSCALL, /* a program that can execute 
>>>>>>> syscalls */
>>>>>>>           BPF_PROG_TYPE_NETFILTER,
>>>>>>> +       BPF_PROG_TYPE_VNET_HASH,
>>>>>>
>>>>>> Sorry, we do not add new stable program types anymore.
>>>>>>
>>>>>>> @@ -6111,6 +6112,10 @@ struct __sk_buff {
>>>>>>>           __u8  tstamp_type;
>>>>>>>           __u32 :24;              /* Padding, future use. */
>>>>>>>           __u64 hwtstamp;
>>>>>>> +
>>>>>>> +       __u32 vnet_hash_value;
>>>>>>> +       __u16 vnet_hash_report;
>>>>>>> +       __u16 vnet_rss_queue;
>>>>>>>    };
>>>>>>
>>>>>> we also do not add anything to uapi __sk_buff.
>>>>>>
>>>>>>> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
>>>>>>> +       .get_func_proto         = sk_filter_func_proto,
>>>>>>> +       .is_valid_access        = sk_filter_is_valid_access,
>>>>>>> +       .convert_ctx_access     = bpf_convert_ctx_access,
>>>>>>> +       .gen_ld_abs             = bpf_gen_ld_abs,
>>>>>>> +};
>>>>>>
>>>>>> and we don't do ctx rewrites like this either.
>>>>>>
>>>>>> Please see how hid-bpf and cgroup rstat are hooking up bpf
>>>>>> in _unstable_ way.
>>>>>
>>>>> Can you describe what "stable" and "unstable" mean here? I'm new to 
>>>>> BPF
>>>>> and I'm worried if it may mean the interface stability.
>>>>>
>>>>> Let me describe the context. QEMU bundles an eBPF program that is used
>>>>> for the "eBPF steering program" feature of tun. Now I'm proposing to
>>>>> extend the feature to allow to return some values to the userspace and
>>>>> vhost_net. As such, the extension needs to be done in a way that 
>>>>> ensures
>>>>> interface stability.
>>>>
>>>> bpf is not an option then.
>>>> we do not add stable bpf program types or hooks any more.
>>>
>>> Does this mean eBPF could not be used for any new use cases other than
>>> the existing ones?
>>
>> It means that any new use of bpf has to be unstable for the time being.
> 
> Can you elaborate more about making new use unstable "for the time 
> being?" Is it a temporary situation? What is the rationale for that? 
> Such information will help devise a solution that is best for both of 
> the BPF and network subsystems.
> 
> I would also appreciate if you have some documentation or link to 
> relevant discussions on the mailing list. That will avoid having same 
> discussion you may already have done in the past.

Hi,

The discussion has been stuck for a month, but I'd still like to 
continue figuring out the way best for the whole kernel to implement 
this feature. I summarize the current situation and question that needs 
to be answered before push this forward:

The goal of this RFC is to allow to report hash values calculated with 
eBPF steering program. It's essentially just to report 4 bytes from the 
kernel to the userspace.

Unfortunately, however, it is not acceptable for the BPF subsystem 
because the "stable" BPF is completely fixed these days. The 
"unstable/kfunc" BPF is an alternative, but the eBPF program will be 
shipped with a portable userspace program (QEMU)[1] so the lack of 
interface stability is not tolerable.

Another option is to hardcode the algorithm that was conventionally 
implemented with eBPF steering program in the kernel[2]. It is possible 
because the algorithm strictly follows the virtio-net specification[3]. 
However, there are proposals to add different algorithms to the 
specification[4], and hardcoding the algorithm to the kernel will 
require to add more UAPIs and code each time such a specification change 
happens, which is not good for tuntap.

In short, the proposed feature requires to make either of three compromises:

1. Compromise on the BPF side: Relax the "stable" BPF feature freeze 
once and allow eBPF steering program to report 4 more bytes to the kernel.

2. Compromise on the tuntap side: Implement the algorithm to the kernel, 
and abandon the capability to update the algorithm without changing the 
kernel.

IMHO, I think it's better to make a compromise on the BPF side (option 
1). We should minimize the total UAPI changes in the whole kernel, and 
option 1 is much superior in that sense.

Yet I have to note that such a compromise on the BPF side can risk the 
"stable" BPF feature freeze fragile and let other people complain like 
"you allowed to change stable BPF for this, why do you reject [some 
other request to change stable BPF]?" It is bad for BPF maintainers. (I 
can imagine that introducing and maintaining widely different BPF 
interfaces is too much burden.) And, of course, this requires an 
approval from BPF maintainers.

So I'd like to ask you that which of these compromises you think worse. 
Please also tell me if you have another idea.

Regards,
Akihiko Odaki

[1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html
[2] 
https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@daynix.com/
[3] 
https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003
[4] 
https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@mail.gmail.com/