[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <31b76f9f-d115-45d4-80d3-af7a8c626b58@nvidia.com>
Date: Mon, 1 Sep 2025 09:30:23 +0300
From: Shay Drori <shayd@...dia.com>
To: Jiri Pirko <jiri@...nulli.us>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
<pabeni@...hat.com>, <horms@...nel.org>, <netdev@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <ozsh@...dia.com>, <mbloch@...dia.com>,
<tariqt@...dia.com>, <saeedm@...dia.com>
Subject: Re: [RFC net-next] net: devlink: add port function attr for vport ↔ eswitch metadata forwarding
On 28/08/2025 12:03, Jiri Pirko wrote:
> External email: Use caution opening links or attachments
>
>
> :q
> Thu, Aug 28, 2025 at 08:52:29AM +0200, shayd@...dia.com wrote:
>> In some product architectures, the eswitch manager and the exception
>> handler run as separate user space processes. The eswitch manager uses
>> the physical uplink device, while the slow path handler uses a virtual
>> device.
>>
>> In this architectures, the eswitch manager application program the HW to
>> send the exception packets to specific vport, and on top this vport
>> virtual device, the exception application is running and handling these
>> packets.
>>
>> Currently, when packets are forwarded between the eswitch and a vport,
>> no per-packet metadata is preserved. As a result, the slow path handler
>> cannot implement features that require visibility into the packet's
>> hardware context.
>
> A vendor-specific slow path. Basically you provide a possibility for
> user to pass a binary blob to hw along with every TX'ed packet and
> vice versa. That looks quite odd tbh. I mean, isn't this horribly
> breaking the socket abstraction? Also, isn't this horribly breaking the
> forwarding offloading model when HW should just mimic the behaviour of
> the kernel?
This feature is targeted at kernel-bypass applications, which already
operate outside the Linux kernel’s traditional networking stack.
These applications need access to hardware-specific metadata to make
forwarding decisions or offload acceleration, and they do not use
“representor” devices in the same way as kernel-driven virtual functions.
The devlink interface configures the hardware switch, and these knobs
adjust how metadata is preserved between the e-switch manager and the
exception handler.
>
>
>
>>
>> This RFC introduces two optional devlink port-function attributes. When
>> these two capabilities are enable for a function of the port, the device
>> is making the necessary preparations for the function to exchange
>> metadata with the eswitch.
>>
>> rx_metadata
>> When enabled, packets received by the vport from the eswitch will be
>> prepended with a device-specific metadata header. This allows the slow
>> path application to receive the full context of the packet as seen by
>> the hardware.
>>
>> tx_metadata
>> When enabled, the vport can send a packet prepended with a metadata
>> header. The eswitch hardware consumes this metadata to steer the packet.
>>
>> Together they allow the said app to process slow-path events in
>> user-space at line rate while still leaving the common fast-path in
>> hardware.
>>
>> User-space interface
>> Enable / disable is done with existing devlink port-function syntax:
>>
>> $ devlink port function set pci/0000:06:00.0/3 rx_metadata enable
>> $ devlink port function set pci/0000:06:00.0/3 tx_metadata enable
>>
>> Querying the state shows the new knobs:
>>
>> $ devlink port function show pci/0000:06:00.0/3
>> pci/0000:06:00.0/3:
>> roce enabled rx_metadata enabled tx_metadata enabled
>>
>> Disabling is symmetrical:
>>
>> $ devlink port function set pci/0000:06:00.0/3 rx_metadata disable
>> $ devlink port function set pci/0000:06:00.0/3 tx_metadata disable
>>
>> Signed-off-by: Shay Drory <shayd@...dia.com>
>>
>>
>> --
>> 2.38.1
>>
Powered by blists - more mailing lists