netdev - Re: [PATCH] net: tuntap: add ioctl() TUNGETQUEUEINDX to fetch queue index

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3a3695a1-367c-4868-b6e1-1190b927b8e7@soulik.info>
Date: Thu, 8 Aug 2024 02:54:12 +0800
From: Randy Li <ayaka@...lik.info>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: netdev@...r.kernel.org, jasowang@...hat.com, davem@...emloft.net,
 edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net: tuntap: add ioctl() TUNGETQUEUEINDX to fetch queue
 index

Hello Willem

On 2024/8/2 23:10, Willem de Bruijn wrote:
> Randy Li wrote:
>> On 2024/8/1 22:17, Willem de Bruijn wrote:
>>> Randy Li wrote:
>>>> On 2024/8/1 21:04, Willem de Bruijn wrote:
>>>>> Randy Li wrote:
>>>>>> On 2024/8/1 05:57, Willem de Bruijn wrote:
>>>>>>> nits:
>>>>>>>
>>>>>>> - INDX->INDEX. It's correct in the code
>>>>>>> - prefix networking patches with the target tree: PATCH net-next
>>>>>> I see.
>>>>>>> Randy Li wrote:
>>>>>>>> On 2024/7/31 22:12, Willem de Bruijn wrote:
>>>>>>>>> Randy Li wrote:
>>>>>>>>>> We need the queue index in qdisc mapping rule. There is no way to
>>>>>>>>>> fetch that.
>>>>>>>>> In which command exactly?
>>>>>>>> That is for sch_multiq, here is an example
>>>>>>>>
>>>>>>>> tc qdisc add dev  tun0 root handle 1: multiq
>>>>>>>>
>>>>>>>> tc filter add dev tun0 parent 1: protocol ip prio 1 u32 match ip dst
>>>>>>>> 172.16.10.1 action skbedit queue_mapping 0
>>>>>>>> tc filter add dev tun0 parent 1: protocol ip prio 1 u32 match ip dst
>>>>>>>> 172.16.10.20 action skbedit queue_mapping 1
>>>>>>>>
>>>>>>>> tc filter add dev tun0 parent 1: protocol ip prio 1 u32 match ip dst
>>>>>>>> 172.16.10.10 action skbedit queue_mapping 2
>>>>>>> If using an IFF_MULTI_QUEUE tun device, packets are automatically
>>>>>>> load balanced across the multiple queues, in tun_select_queue.
>>>>>>>
>>>>>>> If you want more explicit queue selection than by rxhash, tun
>>>>>>> supports TUNSETSTEERINGEBPF.
>>>>>> I know this eBPF thing. But I am newbie to eBPF as well I didn't figure
>>>>>> out how to config eBPF dynamically.
>>>>> Lack of experience with an existing interface is insufficient reason
>>>>> to introduce another interface, of course.
>>>> tc(8) was old interfaces but doesn't have the sufficient info here to
>>>> complete its work.
>>> tc is maintained.
>>>
>>>> I think eBPF didn't work in all the platforms? JIT doesn't sound like a
>>>> good solution for embeded platform.
>>>>
>>>> Some VPS providers doesn't offer new enough kernel supporting eBPF is
>>>> another problem here, it is far more easy that just patching an old
>>>> kernel with this.
>>> We don't add duplicative features because they are easier to
>>> cherry-pick to old kernels.
>> I was trying to say the tc(8) or netlink solution sound more suitable
>> for general deploying.
>>>> Anyway, I would learn into it while I would still send out the v2 of
>>>> this patch. I would figure out whether eBPF could solve all the problem
>>>> here.
>>> Most importantly, why do you need a fixed mapping of IP address to
>>> queue? Can you explain why relying on the standard rx_hash based
>>> mapping is not sufficient for your workload?
>> Server
>>
>>     |
>>
>>     |------ tun subnet (e.x. 172.16.10.0/24) ------- peer A (172.16.10.1)
>>
>> |------ peer B (172.16.10.3)
>>
>> |------  peer C (172.16.10.20)
>>
>> I am not even sure the rx_hash could work here, the server here acts as
>> a router or gateway, I don't know how to filter the connection from the
>> external interface based on rx_hash. Besides, VPN application didn't
>> operate on the socket() itself.
>>
>> I think this question is about why I do the filter in the kernel not the
>> userspace?
>>
>> It would be much more easy to the dispatch work in kernel, I only need
>> to watch the established peer with the help of epoll(). Kernel could
>> drop all the unwanted packets. Besides, if I do the filter/dispatcher
>> work in the userspace, it would need to copy the packet's data to the
>> userspace first, even decide its fate by reading a few bytes from its
>> beginning offset. I think we can avoid such a cost.
> A custom mapping function is exactly the purpose of TUNSETSTEERINGEBPF.
>
> Please take a look at that. It's a lot more elegant than going through
> userspace and then inserting individual tc skbedit filters.

I checked how this socket filter works, I think we still need this 
serial of patch.

If I was right, this eBPF doesn't work like a regular socket filter. The 
eBPF's return value here means the target queue index not the size of 
the data that we want to keep from the sk_buf parameter's buf.

Besides, according to 
https://ebpf-docs.dylanreimerink.nl/linux/program-type/BPF_PROG_TYPE_SOCKET_FILTER/

I think the eBPF here can modify neither queue_mapping field nor hash 
field here.

> See SKF_AD_QUEUE for classic BPF and __sk_buff queue_mapping for eBPF.

Is it a map type BPF_MAP_TYPE_QUEUE?

Besides, I think the eBPF in TUNSETSTEERINGEBPF would NOT take 
queue_mapping.

If I want to drop packets for unwanted destination, I think 
TUNSETFILTEREBPF is what I need?

That would lead to lookup the same mapping table twice, is there a 
better way for the CPU cache?