netdev - Re: [PATCH net 2/2] pds_core: Add a retry mechanism when the adminq is full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <fb796fb9-90d9-4618-a228-e4a8aee8bde6@amd.com>
Date: Fri, 31 Jan 2025 11:25:24 -0800
From: Brett Creeley <bcreeley@....com>
To: Jakub Kicinski <kuba@...nel.org>, Shannon Nelson <shannon.nelson@....com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, andrew+netdev@...n.ch,
 edumazet@...gle.com, pabeni@...hat.com, brett.creeley@....com
Subject: Re: [PATCH net 2/2] pds_core: Add a retry mechanism when the adminq
 is full

On 1/29/2025 7:03 PM, Jakub Kicinski wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Tue, 28 Jan 2025 16:43:37 -0800 Shannon Nelson wrote:
>> If the adminq is full, the driver reports failure when trying to post
>> new adminq commands. This is a bit aggressive and unexpected because
>> technically the adminq post didn't fail in this case, it was just full.
>> To harden this path add support for a bounded retry mechanism.
>>
>> It's possible some commands take longer than expected, maybe hundreds
>> of milliseconds or seconds due to other processing on the device side,
>> so to further reduce the chance of failure due to adminq full increase
>> the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.
>>
>> The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
>> space in the adminq never freed up. In this case they can choose to
>> call the function again or fail. For now, no callers will retry.
> 
> How about a semaphore? You can initialize it to the number of slots
> in the queue, and use down_timeout() if you want the 10 sec timeout?

After spending time digging into it a bit more, I think that this is 
probably the best long term solution.

It seems like we could refactor and replace the pds_core's adminq_refcnt 
that was originally introduced to resolve race conditions related to the 
adminq use/teardown between various client drivers (i.e. vfio/vdpa) with 
a semaphore(). This would solve both the race condition issues mentioned 
above and also the adminq overflow issue in this series.

However, I'm hoping you can accept this v1 solution as the fix for net 
because it does solve a problem and is a simple solution.
In the meantime we can commit to working on a refactor to use a 
semaphore/down_timeout() instead of the adminq_refcnt and this 
sleep/-EAGAIN mechanism for the long term solution that gets pushed to 
net-next. Ideally we do this right now, but it's a bit more of a 
refactor than we feel comfortable with for net.

Thanks,

Brett