lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fb796fb9-90d9-4618-a228-e4a8aee8bde6@amd.com>
Date: Fri, 31 Jan 2025 11:25:24 -0800
From: Brett Creeley <bcreeley@....com>
To: Jakub Kicinski <kuba@...nel.org>, Shannon Nelson <shannon.nelson@....com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, andrew+netdev@...n.ch,
 edumazet@...gle.com, pabeni@...hat.com, brett.creeley@....com
Subject: Re: [PATCH net 2/2] pds_core: Add a retry mechanism when the adminq
 is full



On 1/29/2025 7:03 PM, Jakub Kicinski wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Tue, 28 Jan 2025 16:43:37 -0800 Shannon Nelson wrote:
>> If the adminq is full, the driver reports failure when trying to post
>> new adminq commands. This is a bit aggressive and unexpected because
>> technically the adminq post didn't fail in this case, it was just full.
>> To harden this path add support for a bounded retry mechanism.
>>
>> It's possible some commands take longer than expected, maybe hundreds
>> of milliseconds or seconds due to other processing on the device side,
>> so to further reduce the chance of failure due to adminq full increase
>> the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.
>>
>> The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
>> space in the adminq never freed up. In this case they can choose to
>> call the function again or fail. For now, no callers will retry.
> 
> How about a semaphore? You can initialize it to the number of slots
> in the queue, and use down_timeout() if you want the 10 sec timeout?

After spending time digging into it a bit more, I think that this is 
probably the best long term solution.

It seems like we could refactor and replace the pds_core's adminq_refcnt 
that was originally introduced to resolve race conditions related to the 
adminq use/teardown between various client drivers (i.e. vfio/vdpa) with 
a semaphore(). This would solve both the race condition issues mentioned 
above and also the adminq overflow issue in this series.

However, I'm hoping you can accept this v1 solution as the fix for net 
because it does solve a problem and is a simple solution.
In the meantime we can commit to working on a refactor to use a 
semaphore/down_timeout() instead of the adminq_refcnt and this 
sleep/-EAGAIN mechanism for the long term solution that gets pushed to 
net-next. Ideally we do this right now, but it's a bit more of a 
refactor than we feel comfortable with for net.

Thanks,

Brett


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ