[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bee1e240-cc6a-4c30-a2ae-6f7974627053@nvidia.com>
Date: Thu, 8 May 2025 12:04:22 +0300
From: Mark Bloch <mbloch@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Moshe Shemesh <moshe@...dia.com>, netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Donald Hunter <donald.hunter@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Jonathan Corbet <corbet@....net>, Andrew Lunn <andrew+netdev@...n.ch>,
Tariq Toukan <tariqt@...dia.com>
Subject: Re: [RFC net-next 0/5] devlink: Add unique identifier to devlink port
function
On 08/05/2025 3:43, Jakub Kicinski wrote:
> On Tue, 6 May 2025 18:34:22 +0300 Mark Bloch wrote:
>>>> Flow:
>>>> 1. A user requests a container with networking connectivity.
>>>> 2. Kubernetes allocates a VF on host X. An agent on the host handles VF
>>>> configuration and sends the PF number and VF index to the central
>>>> management software.
>>>
>>> What is "central management software" here? Deployment specific or
>>> some part of k8s?
>>
>> It's the k8s API server.
>>
>>>
>>>> 3. An agent on the DPU side detects the changes made on host X. Using
>>>> the PF number and VF index, it identifies the corresponding
>>>> representor, attaches it to an OVS bridge, and allows OVN to program
>>>> the relevant steering rules.
>>>
>>> What does it mean that DPU "detects it", what's the source and
>>> mechanism of the notification?
>>> Is it communicating with the central SW during the process?
>>
>> The agent (running in the ARM/DPU) listens for events from the k8s API server.
>
> Interesting. So a deployment with no security boundaries. The internals
> of the IPU and the k8s on the host are in the same domain of control.
The VF is created on host X, but the corresponding representor appears
on a different host, the IPU. Naturally, they need to be able to
synchronize and exchange information for everything to work correctly.
>
> So how does the user remotely power cycle the hosts?
Why should a user be able to power cycle the hosts?
Are you are asking about the administrator?
>
> What I'm getting at is that your mental model seems to be missing any
> sort of HW inventory database, which lists all the hosts and how they
> plug into the DC. The administrator of the system must already know
> where each machine is exactly in the chassis for basic DC ops. And
> that HW DB is normally queried in what you describe. If there is any
> security domain crossing in the picture it will require cross checking
> against that HW DB.
You're assuming that external host numbering and PCI enumeration are
stable, also users can determine the mapping only after creating
VFs. But even then, the mapping is indirect e.g: “I created a VF on
this PF, and I see a single representor appear on the IPU, so they
must be linked.” That approach is fragile and error prone.
Also, keep in mind: the external hosts and their kernels shouldn’t
be aware they’re part of a multi-host system. With our current
approach, you just need to provide a host-to-IPU mapping
upfront, no guesswork involved.
Just thinking out loud, once this feature is in place, we might
not even need a static mapping between external hosts and IPU hosts.
If VUID and FUID are globally unique, the following workflow
becomes possible:
- A user requests a container with network connectivity.
- k8s allocates and configures a VF on one of the hosts.
It then sends the VUID, PF number, and VF index for the new VF
to the k8S API server.
- Somewhere in the network, a representor appears. An agent detects
this and notifies the k8s API server, including its FUID,
PF number, and VF index.
- The API server matches the VF and representor data based on the
globally unique identifiers and sends the relevant information
back to the agent that reported the representor creation.
- The agent attaches the representor to the OVS bridge, and with
OVN configures the appropriate steering rules.
This would remove the need for pre defined host to IPU mappings
and allow for a more dynamic and flexible setup.
>
> I don't think this is sufficiently well established to warrant new uAPI.
> You can use a UUID and pass it via ndo_get_phys_port_id.
phys_port_id only applies to netdev interfaces, whereas this use case is
broader and more aligned with devlink. We believe devlink is a more
appropriate place for this functionality.
Mark
Powered by blists - more mailing lists