[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7tcyaisqu5.fsf@redhat.com>
Date: Wed, 02 Jul 2025 08:46:26 -0400
From: Aaron Conole <aconole@...hat.com>
To: Ilya Maximets <i.maximets@....org>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>, Eric
Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo
Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
dev@...nvswitch.org, linux-kernel@...r.kernel.org, Eelco Chaudron
<echaudro@...hat.com>
Subject: Re: [PATCH net-next] net: openvswitch: allow providing upcall pid
for the 'execute' command
Ilya Maximets <i.maximets@....org> writes:
> On 7/2/25 2:14 PM, Aaron Conole wrote:
>> Ilya Maximets <i.maximets@....org> writes:
>>
>>> When a packet enters OVS datapath and there is no flow to handle it,
>>> packet goes to userspace through a MISS upcall. With per-CPU upcall
>>> dispatch mechanism, we're using the current CPU id to select the
>>> Netlink PID on which to send this packet. This allows us to send
>>> packets from the same traffic flow through the same handler.
>>>
>>> The handler will process the packet, install required flow into the
>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>>
>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>>> recirculation action that will pass the (likely modified) packet
>>> through the flow lookup again. And if the flow is not found, the
>>> packet will be sent to userspace again through another MISS upcall.
>>>
>>> However, the handler thread in userspace is likely running on a
>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>>> in the syscall context of that thread. So, when the time comes to
>>> send the packet through another upcall, the per-CPU dispatch will
>>> choose a different Netlink PID, and this packet will end up processed
>>> by a different handler thread on a different CPU.
>>
>> Just wondering but why can't we choose the existing core handler when
>> running the packet_cmd_execute? For example, when looking into the
>> per-cpu table we know what the current core is, can we just queue to
>> that one? I actually thought that's what the PER_CPU dispatch mode was
>> supposed to do.
>
> This is exactly how it works today and it is the problem, because our
> userspace handler is running on a different CPU and so the 'current CPU'
> during the packet_cmd_execute is different from the one where kernel
> was processing the original upcall.
>
>> Or is it that we want to make sure we keep the
>> association between the skbuff for re-injection always?
>
> We want the same packet to be enqueued to the same upcall socket after
> each recirculation, so it gets handled by the same userspace thread.
>
>>
>>> The process continues as long as there are new recirculations, each
>>> time the packet goes to a different handler thread before it is sent
>>> out of the OVS datapath to the destination port. In real setups the
>>> number of recirculations can go up to 4 or 5, sometimes more.
>>
>> Is it because the userspace handler threads are being rescheduled across
>> CPUs?
>
> Yes. Userspace handlers are not pinned to a specific core in most cases,
> so they will be running on different CPUs and will float around.
>
>> Do we still see this behavior if we pinned each handler thread to
>> a specific CPU rather than letting the scheduler make the decision?
>
> If you pin each userspace thread to a core that is specified in the
> PCPU_UPCALL_PIDS for the socket that it is listening on, then the
> problem will go away, as the packet_cmd_execute syscall will be executed
> on the same core where the kernel received the original packet.
> However, that's not possible in many cases (reserved CPUs, different
> CPU affinity for IRQs and userspace applications, etc.) and not desired
> as may impact performance of the system, because the kernel and userspace
> will compete for the same core.
Just wanted to make sure I was understanding the problem. Okay, this
makes sense.
>>
>>> There is always a chance to re-order packets while processing upcalls,
>>> because userspace will first install the flow and then re-inject the
>>> original packet. So, there is a race window when the flow is already
>>> installed and the second packet can match it and be forwarded to the
>>> destination before the first packet is re-injected. But the fact that
>>> packets are going through multiple upcalls handled by different
>>> userspace threads makes the reordering noticeably more likely, because
>>> we not only have a race between the kernel and a userspace handler
>>> (which is hard to avoid), but also between multiple userspace handlers.
>>>
>>> For example, let's assume that 10 packets got enqueued through a MISS
>>> upcall for handler-1, it will start processing them, will install the
>>> flow into the kernel and start re-injecting packets back, from where
>>> they will go through another MISS to handler-2. Handler-2 will install
>>> the flow into the kernel and start re-injecting the packets, while
>>> handler-1 continues to re-inject the last of the 10 packets, they will
>>> hit the flow installed by handler-2 and be forwarded without going to
>>> the handler-2, while handler-2 still re-injects the first of these 10
>>> packets. Given multiple recirculations and misses, these 10 packets
>>> may end up completely mixed up on the output from the datapath.
>>>
>>> Let's allow userspace to specify on which Netlink PID the packets
>>> should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
>>> This makes it possible to ensure that all the packets are processed
>>> by the same handler thread in the userspace even with them being
>>> upcalled multiple times in the process. Packets will remain in order
>>> since they will be enqueued to the same socket and re-injected in the
>>> same order. This doesn't eliminate re-ordering as stated above, since
>>> we still have a race between kernel and the userspace thread, but it
>>> allows to eliminate races between multiple userspace threads.
>>>
>>> Userspace knows the PID of the socket on which the original upcall is
>>> received, so there is no need to send it up from the kernel.
>>>
>>> Solution requires storing the value somewhere for the duration of the
>>> packet processing. There are two potential places for this: our skb
>>> extension or the per-CPU storage. It's not clear which is better,
>>> so just following currently used scheme of storing this kind of things
>>> along the skb.
>>
>> With this change we're almost full on the OVS sk_buff control block.
>> Might be good to mention it in the commit message if you're respinning.
>
> Are we full? The skb->cb size is 48 bytes and we're only using 24
> with this change, unless I'm missing something.
Hrrm... I guess I miscounted. Yes I agree, with this change we're at 24
bytes in-use on 64-bit platform. Okay no worries.
> Best regards, Ilya Maximets.
Powered by blists - more mailing lists