linux-kernel - Re: [PATCH net-next] net: openvswitch: allow providing upcall pid for the 'execute' command

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7tcyaisqu5.fsf@redhat.com>
Date: Wed, 02 Jul 2025 08:46:26 -0400
From: Aaron Conole <aconole@...hat.com>
To: Ilya Maximets <i.maximets@....org>
Cc: netdev@...r.kernel.org,  "David S. Miller" <davem@...emloft.net>,  Eric
 Dumazet <edumazet@...gle.com>,  Jakub Kicinski <kuba@...nel.org>,  Paolo
 Abeni <pabeni@...hat.com>,  Simon Horman <horms@...nel.org>,
  dev@...nvswitch.org,  linux-kernel@...r.kernel.org,  Eelco Chaudron
 <echaudro@...hat.com>
Subject: Re: [PATCH net-next] net: openvswitch: allow providing upcall pid
 for the 'execute' command

Ilya Maximets <i.maximets@....org> writes:

> On 7/2/25 2:14 PM, Aaron Conole wrote:
>> Ilya Maximets <i.maximets@....org> writes:
>> 
>>> When a packet enters OVS datapath and there is no flow to handle it,
>>> packet goes to userspace through a MISS upcall.  With per-CPU upcall
>>> dispatch mechanism, we're using the current CPU id to select the
>>> Netlink PID on which to send this packet.  This allows us to send
>>> packets from the same traffic flow through the same handler.
>>>
>>> The handler will process the packet, install required flow into the
>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>>
>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>>> recirculation action that will pass the (likely modified) packet
>>> through the flow lookup again.  And if the flow is not found, the
>>> packet will be sent to userspace again through another MISS upcall.
>>>
>>> However, the handler thread in userspace is likely running on a
>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>>> in the syscall context of that thread.  So, when the time comes to
>>> send the packet through another upcall, the per-CPU dispatch will
>>> choose a different Netlink PID, and this packet will end up processed
>>> by a different handler thread on a different CPU.
>> 
>> Just wondering but why can't we choose the existing core handler when
>> running the packet_cmd_execute?  For example, when looking into the
>> per-cpu table we know what the current core is, can we just queue to
>> that one?  I actually thought that's what the PER_CPU dispatch mode was
>> supposed to do.
>
> This is exactly how it works today and it is the problem, because our
> userspace handler is running on a different CPU and so the 'current CPU'
> during the packet_cmd_execute is different from the one where kernel
> was processing the original upcall.
>
>> Or is it that we want to make sure we keep the
>> association between the skbuff for re-injection always?
>
> We want the same packet to be enqueued to the same upcall socket after
> each recirculation, so it gets handled by the same userspace thread.
>
>> 
>>> The process continues as long as there are new recirculations, each
>>> time the packet goes to a different handler thread before it is sent
>>> out of the OVS datapath to the destination port.  In real setups the
>>> number of recirculations can go up to 4 or 5, sometimes more.
>> 
>> Is it because the userspace handler threads are being rescheduled across
>> CPUs?
>
> Yes.  Userspace handlers are not pinned to a specific core in most cases,
> so they will be running on different CPUs and will float around.
>
>> Do we still see this behavior if we pinned each handler thread to
>> a specific CPU rather than letting the scheduler make the decision?
>
> If you pin each userspace thread to a core that is specified in the
> PCPU_UPCALL_PIDS for the socket that it is listening on, then the
> problem will go away, as the packet_cmd_execute syscall will be executed
> on the same core where the kernel received the original packet.
> However, that's not possible in many cases (reserved CPUs, different
> CPU affinity for IRQs and userspace applications, etc.) and not desired
> as may impact performance of the system, because the kernel and userspace
> will compete for the same core.

Just wanted to make sure I was understanding the problem.  Okay, this
makes sense.

>> 
>>> There is always a chance to re-order packets while processing upcalls,
>>> because userspace will first install the flow and then re-inject the
>>> original packet.  So, there is a race window when the flow is already
>>> installed and the second packet can match it and be forwarded to the
>>> destination before the first packet is re-injected.  But the fact that
>>> packets are going through multiple upcalls handled by different
>>> userspace threads makes the reordering noticeably more likely, because
>>> we not only have a race between the kernel and a userspace handler
>>> (which is hard to avoid), but also between multiple userspace handlers.
>>>
>>> For example, let's assume that 10 packets got enqueued through a MISS
>>> upcall for handler-1, it will start processing them, will install the
>>> flow into the kernel and start re-injecting packets back, from where
>>> they will go through another MISS to handler-2.  Handler-2 will install
>>> the flow into the kernel and start re-injecting the packets, while
>>> handler-1 continues to re-inject the last of the 10 packets, they will
>>> hit the flow installed by handler-2 and be forwarded without going to
>>> the handler-2, while handler-2 still re-injects the first of these 10
>>> packets.  Given multiple recirculations and misses, these 10 packets
>>> may end up completely mixed up on the output from the datapath.
>>>
>>> Let's allow userspace to specify on which Netlink PID the packets
>>> should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
>>> This makes it possible to ensure that all the packets are processed
>>> by the same handler thread in the userspace even with them being
>>> upcalled multiple times in the process.  Packets will remain in order
>>> since they will be enqueued to the same socket and re-injected in the
>>> same order.  This doesn't eliminate re-ordering as stated above, since
>>> we still have a race between kernel and the userspace thread, but it
>>> allows to eliminate races between multiple userspace threads.
>>>
>>> Userspace knows the PID of the socket on which the original upcall is
>>> received, so there is no need to send it up from the kernel.
>>>
>>> Solution requires storing the value somewhere for the duration of the
>>> packet processing.  There are two potential places for this: our skb
>>> extension or the per-CPU storage.  It's not clear which is better,
>>> so just following currently used scheme of storing this kind of things
>>> along the skb.
>> 
>> With this change we're almost full on the OVS sk_buff control block.
>> Might be good to mention it in the commit message if you're respinning.
>
> Are we full?  The skb->cb size is 48 bytes and we're only using 24
> with this change, unless I'm missing something.

Hrrm... I guess I miscounted.  Yes I agree, with this change we're at 24
bytes in-use on 64-bit platform.  Okay no worries.

> Best regards, Ilya Maximets.