[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fb796fb9-90d9-4618-a228-e4a8aee8bde6@amd.com>
Date: Fri, 31 Jan 2025 11:25:24 -0800
From: Brett Creeley <bcreeley@....com>
To: Jakub Kicinski <kuba@...nel.org>, Shannon Nelson <shannon.nelson@....com>
Cc: netdev@...r.kernel.org, davem@...emloft.net, andrew+netdev@...n.ch,
edumazet@...gle.com, pabeni@...hat.com, brett.creeley@....com
Subject: Re: [PATCH net 2/2] pds_core: Add a retry mechanism when the adminq
is full
On 1/29/2025 7:03 PM, Jakub Kicinski wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Tue, 28 Jan 2025 16:43:37 -0800 Shannon Nelson wrote:
>> If the adminq is full, the driver reports failure when trying to post
>> new adminq commands. This is a bit aggressive and unexpected because
>> technically the adminq post didn't fail in this case, it was just full.
>> To harden this path add support for a bounded retry mechanism.
>>
>> It's possible some commands take longer than expected, maybe hundreds
>> of milliseconds or seconds due to other processing on the device side,
>> so to further reduce the chance of failure due to adminq full increase
>> the PDS_CORE_DEVCMD_TIMEOUT from 5 to 10 seconds.
>>
>> The caller of pdsc_adminq_post() may still see -EAGAIN reported if the
>> space in the adminq never freed up. In this case they can choose to
>> call the function again or fail. For now, no callers will retry.
>
> How about a semaphore? You can initialize it to the number of slots
> in the queue, and use down_timeout() if you want the 10 sec timeout?
After spending time digging into it a bit more, I think that this is
probably the best long term solution.
It seems like we could refactor and replace the pds_core's adminq_refcnt
that was originally introduced to resolve race conditions related to the
adminq use/teardown between various client drivers (i.e. vfio/vdpa) with
a semaphore(). This would solve both the race condition issues mentioned
above and also the adminq overflow issue in this series.
However, I'm hoping you can accept this v1 solution as the fix for net
because it does solve a problem and is a simple solution.
In the meantime we can commit to working on a refactor to use a
semaphore/down_timeout() instead of the adminq_refcnt and this
sleep/-EAGAIN mechanism for the long term solution that gets pushed to
net-next. Ideally we do this right now, but it's a bit more of a
refactor than we feel comfortable with for net.
Thanks,
Brett
Powered by blists - more mailing lists