netdev - Re: [net-next,RFC PATCH 0/5] Configuring NAPI instance for a queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <acbe612f-faaa-4c70-802f-87504ee7c274@intel.com>
Date: Thu, 18 Apr 2024 14:23:03 -0700
From: "Nambiar, Amritha" <amritha.nambiar@...el.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: <netdev@...r.kernel.org>, <davem@...emloft.net>, <edumazet@...gle.com>,
	<pabeni@...hat.com>, <ast@...nel.org>, <sdf@...gle.com>,
	<lorenzo@...nel.org>, <tariqt@...dia.com>, <daniel@...earbox.net>,
	<anthony.l.nguyen@...el.com>, <lucien.xin@...il.com>, <hawk@...nel.org>,
	<sridhar.samudrala@...el.com>
Subject: Re: [net-next,RFC PATCH 0/5] Configuring NAPI instance for a queue

On 4/11/2024 6:47 PM, Jakub Kicinski wrote:
> On Thu, 11 Apr 2024 15:46:45 -0700 Nambiar, Amritha wrote:
>> On 4/9/2024 4:21 PM, Jakub Kicinski wrote:
>>> On Fri, 05 Apr 2024 13:09:28 -0700 Amritha Nambiar wrote:
>>>> $ ./cli.py --spec netdev.yaml --do queue-set  --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
>>>> {'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}
>>>
>>> NAPI ID is not stable. What happens if you set the ID, bring the
>>> device down and up again? I think we need to make NAPI IDs stable.
>>
>> I tried this (device down/up and check NAPIs) on both bnxt and intel/ice.
>> On bnxt: New NAPI IDs are created sequentially once the device is up
>> after turning down.
>> On ice: The NAPI IDs are stable and remains the same once the device is
>> up after turning down.
>>
>> In case of ice, device down/up executes napi_disable/napi_enable. The
>> NAPI IDs are not lost as netif_napi_del is not called at IFF_DOWN. On
>> IFF_DOWN, the IRQs associations with the OS are freed, but the resources
>> allocated for the vectors and hence the NAPIs for the vectors persists
>> (unless unload/reconfig).
> 
> SG! So let's just make sure we cover that in tests.
> 
>>> What happens if you change the channel count? Do we lose the config?
>>> We try never to lose explicit user config. I think for simplicity
>>> we should store the config in the core / common code.
>>
>> Yes, we lose the config in case of re-configuring channels. The reconfig
>> path involves freeing the vectors and reallocating based on the new
>> channel config, so, for the NAPIs associated with the vectors,
>> netif_napi_del and netif_napi_add executes creating new NAPI IDs
>> sequentially.
>>
>> Wouldn't losing the explicit user config make sense in this case? By
>> changing the channel count, the user has updated the queue layout, the
>> queue<>vector mappings etc., so I think, the previous configs from set
>> queue<>NAPI should be overwritten with the new config from set-channel.
> 
> We do prevent indirection table from being reset on channel count
> change. I think same logic applies here..
> 

Okay. I tried this on bnxt (this may be outside scope and secondary, but 
hoping all the additional information helps).
It looks like bnxt differentiates if the indirection table was based on 
driver defaults vs user configuration. If the indirection table was from 
driver defaults, then changing channel count to fewer queues is allowed. 
If it was based on explicit user configuration, changing channel count 
to fewer queues is not allowed as the indirection table might then point 
to inactive queues. So, the rss user configuration is preserved by 
blocking new channel configurations that do not align.
So applying the same logic here would mean, changing the channel count 
to queues < 'default queue for the last user configured NAPI ID' would 
have to be prevented. This becomes difficult to track unless pre-set 
default queue <> NAPI configs are maintained.

>>> How does the user know whether queue <> NAPI association is based
>>> on driver defaults or explicit configuration?
>>
>> I am not sure of this. ethtool shows pre-set defaults and current
>> settings, but in this case, it is tricky :(
> 
> Can you say more about the use case for moving the queues around?
> If you just want to have fewer NAPI vectors and more queues, but
> don't care about exact mapping - we could probably come up with
> a simpler API, no? Are the queues stack queues or also AF_XDP?
> 

I'll try to explain. The goal is to have fewer NAPI pollers. The number 
of NAPI pollers is the same as the number of active NAPIs (kthread per 
NAPI). It is possible to limit the number of pollers by mapping 
multiples queues on an interrupt vector (fewer vectors, more queues) 
implicitly in the driver. But, we are looking for a more granular 
approach, in our case, the queues are grouped into 
queue-groups/rss-contexts. We would like to reduce the number of pollers 
within certain selected queue-groups/rss-contexts (not all the 
queue-groups), hence need the configurability.
This would benefit our hyper-threading use case, where a single physical 
core can be used for both network and application processing. If the 
NAPI to queue association is known, we can pin the NAPI thread to the 
logical core and the application thread to the corresponding sibling 
logical core.

The queues are stack queues, not AF_XDP.

>>> I think I mentioned
>>> this in earlier discussions but the configuration may need to be
>>> detached from the existing objects (for one thing they may not exist
>>> at all when the device is down).
>>
>> Yes, we did have that discussion about detaching queues from NAPI. But,
>> I am not sure how to accomplish that. Any thoughts on what other
>> possible object can be used for the configuration?
> 
> We could stick to the queue as the object perhaps. The "target NAPI"
> would just be part of the config passed to the alloc/start callbacks.
> 

Okay.

>> WRT ice, when the device is down, the queues are listed and exists as
>> inactive queues, NAPI IDs exists, IRQs associations with the OS are freed.
>>
>>> Last but not least your driver patch implements the start/stop steps
>>> of the "queue API" I think we should pull that out into the core.
>>
>> Agree, it would be good to have these steps in the core, but I think the
>> challenge is that we would still end up with a lot of code in the driver
>> as well, due to all the hardware-centric bits in it.
> 
> For one feature I think adding code in the core is not beneficial.
> But we have multiple adjacent needs, so when we add up your work,
> zero copy, page pool config, maybe queue alloc.. hopefully the code
> in the core will be net positive.
> 
>>> Also the tests now exist - take a look at the sample one in
>>> tools/testing/selftests/drivers/net/stats.py
>>> Would be great to have all future netdev family extensions accompanied
>>> by tests which can run both on real HW and netdevsim.
>>
>> Okay, I will write tests for the new extensions here.