[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0f3cd937-8a10-47c3-8016-4ff7f31e506b@intel.com>
Date: Mon, 24 Feb 2025 11:15:33 +0100
From: Przemek Kitszel <przemyslaw.kitszel@...el.com>
To: Jacob Keller <jacob.e.keller@...el.com>, Jakub Kicinski <kuba@...nel.org>,
Jiri Pirko <jiri@...nulli.us>
CC: <intel-wired-lan@...ts.osuosl.org>, Tony Nguyen
<anthony.l.nguyen@...el.com>, Cosmin Ratiu <cratiu@...dia.com>, Tariq Toukan
<tariqt@...dia.com>, <netdev@...r.kernel.org>, Konrad Knitter
<konrad.knitter@...el.com>, <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Andrew Lunn
<andrew@...n.ch>, <linux-kernel@...r.kernel.org>, ITP Upstream
<nxne.cnse.osdt.itp.upstreaming@...el.com>, Carolina Jubran
<cjubran@...dia.com>
Subject: Re: [RFC net-next v2 1/2] devlink: add whole device devlink instance
On 2/21/25 23:50, Jacob Keller wrote:
>
>
> On 2/20/2025 5:45 PM, Jakub Kicinski wrote:
>> On Wed, 19 Feb 2025 17:32:54 +0100 Przemek Kitszel wrote:
>>> Add a support for whole device devlink instance. Intented as a entity
>>> over all PF devices on given physical device.
>>>
>>> In case of ice driver we have multiple PF devices (with their devlink
>>> dev representation), that have separate drivers loaded. However those
>>> still do share lots of resources due to being the on same HW. Examples
>>> include PTP clock and RSS LUT. Historically such stuff was assigned to
>>> PF0, but that was both not clear and not working well. Now such stuff
>>> is moved to be covered into struct ice_adapter, there is just one instance
>>> of such per HW.
>>>
>>> This patch adds a devlink instance that corresponds to that ice_adapter,
>>> to allow arbitrage over resources (as RSS LUT) via it (further in the
>>> series (RFC NOTE: stripped out so far)).
>>>
>>> Thanks to Wojciech Drewek for very nice naming of the devlink instance:
>>> PF0: pci/0000:00:18.0
>>> whole-dev: pci/0000:00:18
>>> But I made this a param for now (driver is free to pass just "whole-dev").
>>
>> Which only works nicely if you're talking about functions not full
>> separate links :) When I was thinking about it a while back my
that's why I have make the name as a param, instead letting devlink
infer it
I admit that with ice/e800 we could have done better with splitting
into devlink and devlink port parts at beginning, but with growing
complexities of devices, we are going to hit the "we need something
above" issue anyway.
>> intuition was that we should have a single instance, just accessible
>> under multiple names. But I'm not married to that direction if there
>> are problems with it.
>>
>
> I would also prefer to see a single devlink instance + one port for each
> function. I think thats the most natural fit to how devlink works, and
> it gives us a natural entry point for "whole device" configuration. It
> also limits the amount of duplicate data, for example "devlink dev info"
> reports once for each function.
>
>
> The main things I think this causes problems for are:
>
> 1) PCIe direct assignment with IOV
>
> This could be an issue in cases where someone assigns only one function
> to a VM. The VM would only see one function and the functions outside
> the VM would not interact with it. IMHO this is not a big deal as I
> think simply assigning the entire device into the VM is more preferable.
>
> We also already have this issue with ice_adapter, and we've seen that we
> need to do this in order to make the device and software function
> properly. Assigning single functions does not make much sense to me. In
> addition, there is SR-IOV if you want to assign a portion of the device
> to a VM.
>
> 2) locking may get complicated
>
if any driver would go the "plain devlink" + "port devlink" route,
the devl_lock(devlink) and devl_lock(devlink_port) should be enough
> If we have entry point which needs to interact with ice_pf data the
> locking could get a little complicated, but I think this is also an
> issue we can solve with ice_adapter, as a natural place to put
> whole-device functionality.
>
> I have also investigated in the past if it was possible to make the PCI
> bus subsystem wrap the functions together somehow to represent them to
> the host as a sort of pseudo "single-function" even tho the hardware is
> multi-function. This seemed like a natural way to prevent direct
> assignment of the whole device.. but I was never able to figure out how
> to even start on such a path.
I get that the general sentiment is to "leave the complexities to the
driver/other layers", but it was based on reading only limited amount
of internal (non networking) mailing lists.
It will be great when devlink will be finally used by non networking
drivers :)
>
>>> $ devlink dev # (Interesting part of output only)
>>> pci/0000:af:00:
>>> nested_devlink:
>>> pci/0000:af:00.0
>>> pci/0000:af:00.1
BTW, I have local version that adds SR-IOV VF's devlink instances
as nested ones to PF ones:
pci/0000:af:00.1:
nested_devlink:
pci/0000:af:05.0
>>> pci/0000:af:00.2
>>> pci/0000:af:00.3
>>> pci/0000:af:00.4
>>> pci/0000:af:00.5
>>> pci/0000:af:00.6
>>> pci/0000:af:00.7
>>
>> Could you go into more details on what stays on the "nested" instances
>> and what moves to the "whole-dev"? Jiri recently pointed out to y'all
>> cases where stuff that should be a port attribute was an instance
>> attribute.
My initial assumption was that everything stays as-is
>
> I suspect this is a case of "we have separate devlink instances per
> function, so we just put it in the devlink".
That's true that our 1:1 mapping of PF:devlink made this naively obvious
moving almost all the stuff we now have under "PF" devlink into devlink
port instances is likely too much of uAPI change, but let's make the
question vocal: should we?
Powered by blists - more mailing lists