netdev - Re: [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for more than 16 RSS queues for VF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <171a8f73-5dd4-4d0f-b871-a341539a2fca@intel.com>
Date: Wed, 19 Jun 2024 18:10:16 +0200
From: Przemek Kitszel <przemyslaw.kitszel@...el.com>
To: Jakub Kicinski <kuba@...nel.org>, Jiri Pirko <jiri@...nulli.us>
CC: Pawel Chmielewski <pawel.chmielewski@...el.com>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, "Knitter, Konrad" <konrad.knitter@...el.com>,
	"Ahmed Zaki" <ahmed.zaki@...el.com>, "Samudrala, Sridhar"
	<sridhar.samudrala@...el.com>, "intel-wired-lan@...ts.osuosl.org"
	<intel-wired-lan@...ts.osuosl.org>, Simon Horman <horms@...nel.org>, "Mateusz
 Polchlopek" <mateusz.polchlopek@...el.com>, "Nguyen, Anthony L"
	<anthony.l.nguyen@...el.com>, Paolo Abeni <pabeni@...hat.com>
Subject: Re: [Intel-wired-lan] [RFC net-next (what uAPI?) ice: add support for
 more than 16 RSS queues for VF

On 5/6/24 16:34, Przemek Kitszel wrote:
> On 4/30/24 03:59, Jakub Kicinski wrote:
>> On Fri, 26 Apr 2024 15:22:02 +0200 Przemek Kitszel wrote:
>>> ## devlink resources (with current API)
>>> `devlink resource` is compelling, partially given the name sounds like a
>>> perfect match. But when we dig just a little bit, the current Path+sizes
>>> (min,max,step) is totally off to what is the most elegant picture of the
>>> situation. In order to fit into existing uAPI, I would need to register
>>> VFs as PF's resource, then GLOBAL LUT and PF LUT as a sub resource to
>>> that (each VF gets two entries under it; plus two additional ones for
>>> PF) I don't like it, I also feel like there is not that much use of
>>> current resources API (it's not natural to use it for distribution, only
>>> for limitation).
>>
>> Can you share more on how that would look like?
> 
> something like below (though I have added one more layer to illustrate
> broader idea, it could be chopped down)
> 
> [1]
> 
>>
>>  From the description it does not sound so bad. Maybe with some CLI / UI
>> changes it will be fine?
>>
>>> ## devlink resources (with extended API)
>>> It is possible to extend current `devlink resource` so instead of only
>>> Path+size, there would be also Path+Owner option to use.
>>> The default state for ice driver would be that PFs owns PF LUTs, GLOBAL
>>> LUTs are all free.
>>>
>>> example proposed flow to assign a GLOBAL LUT to VF0 and PF LUT to VF1:
>>> pf=0000:03:00.0  # likely more meaningful than VSI idx, but open for
>>> vf0=0000:03:00.1 #                                       suggestions
>>> vf1=0000:03:00.2
>>> devlink resource set pci/$pf path /lut/lut_table_512 owner $pf
>>> devlink resource set pci/$pf path /lut/lut_table_2048 owner free
>>> devlink resource set pci/$pf path /lut/lut_table_512 owner $vf0
>>> devlink resource set pci/$pf path /lut/lut_table_2048 owner $vf1
>>
>> Don't we want some level of over-subscription to be allowed?
> 
> In theory we could reuse/share tables, especially with the case of auto
> filled ones, not sure if I would want to implement that with the very
> first series, but good point to design interface with that in mind.
> To move in this direction we could make numbered LUTs an entity that is
> set/show'ed (digression: this would feel more like mmap() than malloc())
> (The imaginary output below does not do that).
> 
> The main problem with the [1] above for "current API" for me is lack of
> aggregate device [2] in the structures represented by devlink. Let me
> show what it would be with one more layer (so I put PFs under that, and
> VFs one layer down).
> 
> Here is an example with 2 PFs, one of with with 3 VFs, each of them with
> different LUT
> 
> $ devlink resource show $device
> $device:
>    name lut size 4 unit entry
>      resources:
>        name lut_table_2048 size 2 unit entry size_min 0 size_max 8 
> size_gran 1
>        name lut_table_512 size 2 unit entry size_min 0 size_max 16 
> size_gran 1
>    name functions size 5 unit entry
>      resources:
>        name pf0
>          resources:
>            name lut_table_2048 size 0 size_max 1 ...
>            name lut_table_512 size 1 size_max 1 ...
>            name vf0
>              resources:
>                name lut_table_2048 size 1 size_max 1 ...
>                name lut_table_512 size 0 size_max 1 ...
>            name vf1
>              resources:
>                name lut_table_2048 size 0 size_max 1 ...
>                name lut_table_512 size 1 size_max 1 ...
>            name vf2
>              resources:
>                name lut_table_2048 size 0 size_max 1 ...
>                name lut_table_512 size 0 size_max 1 ...
>        name pf1
>          resources:
>            name lut_table_2048 size 1 ...
>            name lut_table_512 size 0 ...
> 
> where $device stands for the aggregate device (covers 8 PFs in case of
> 8-port split used)


As of now I started playing with that so I have an aggregate devlink
instance, but it is registered for the very same struct device as PF0.

So that presents an issue for user trying to show only PF0, but output
shows all the data (otherwise there would be no option to show
all of it)
Assigning resources will be easier as aggregate device could simply
disallow it, but it is similar.

Perhaps some userspace devlink tweak would resolve that, but would it be
nice to have different devlink handle (without creating a dummy PCI
device).

Any advice how to make it distinguished?
(IOW: give my aggregate/whole device a different name that any PF)

Please find below what I have now in terms of nesting (doubled device
addresses are easy to spot):
iproute2 $ ./devlink/devlink dev
pci/0000:18:00.0 # PF0, as I devlink_register() both PFs and whole-dev
pci/0000:18:00.0: # whole-dev one
   nested_devlink:
     pci/0000:18:00.0 # PF0
pci/0000:af:00.0
pci/0000:af:00.0:
   nested_devlink:
     pci/0000:af:00.0
     pci/0000:af:00.1
     pci/0000:af:00.2
     pci/0000:af:00.3
     pci/0000:af:00.4
     pci/0000:af:00.5
     pci/0000:af:00.6
     pci/0000:af:00.7
pci/0000:af:00.1
pci/0000:af:00.2
pci/0000:af:00.3
pci/0000:af:00.4
pci/0000:af:00.5
pci/0000:af:00.6
pci/0000:af:00.7


> and all named PF/VFs in the output would need "dummy" size calculations
> (I would like to remove this need though)
> 
> When all resources are assigned, all "size 0" entries should have also
> "size_max 0" to indicate no room for growth.
> 
> [2] aggregate device discussion:
> https://lore.kernel.org/intel-wired-lan/cfe84890-f999-4b97-a012-6f9fd16da8e3@intel.com/