[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7ab5505-92e5-888c-a230-77bce3540261@oracle.com>
Date: Tue, 2 Jul 2019 09:49:06 -0700
From: santosh.shilimkar@...cle.com
To: Gerd Rausch <gerd.rausch@...cle.com>, netdev@...r.kernel.org
Cc: David Miller <davem@...emloft.net>
Subject: Re: [PATCH net-next 3/7] net/rds: Wait for the FRMR_IS_FREE (or
FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV
On 7/1/19 10:11 PM, Gerd Rausch wrote:
> Hi Santosh,
>
> On 01/07/2019 19.28, santosh.shilimkar@...cle.com wrote:
>>>
>> Below. All command timeouts are 60 seconds.
>>
>> enum {
>> MLX4_CMD_TIME_CLASS_A = 60000,
>> MLX4_CMD_TIME_CLASS_B = 60000,
>> MLX4_CMD_TIME_CLASS_C = 60000,
>> };
>>
>
> Thank you for the pointer.
>
>> But having said that, I re-looked the code you are patching
>> and thats actually only FRWR code which is purely work-request
>> based so this command timeout shouldn't matter.
>>
>
> Which brings us back full circle to the question of
> what the timeout ought to be?
>
> Please keep in mind that prior to this fix,
> the RDS code didn't wait at all:
>
> It simply posted those registration (IB_WR_REG_MR)
> and invalidation (IB_WR_LOCAL_INV)
> work-requests, with no regards to when the firmware
> would execute them.
>
> Arguably, waiting any amount time greater than zero
> for the operation to complete is better than not waiting at all.
>
> We can change the timeout to a high value, or even make it infinite
> by using "wait_event" instead of "wait_event_timeout".
>
> For the registration work-requests there is a benefit to wait a short
> amount of time only (the trade-off described in patch #1 of this series).
>
Actually we should just switch this code to what Avinash has
finally made in downstream code. That keeps the RDS_GET_MR
semantics and makes sure MR is really valid before handing over
the key to userland. There is no need for any timeout
for registration case.
> For de-registration work-requests, it is beneficial to wait
> until they are truly done.
> But: Function "rds_ib_unreg_frmr" prior and post this change
> simply moves on after a failed de-registration attempt,
> and releases the pages owned by the memory region.
>
> This patch does _not_ change that behavior.
>
>> If the work request fails, then it will lead to flush errors and
>> MRs will be marked as STALE. So this wait may not be necessary
>>
>
> This wait is necessary to avoid the 2 scenarios described
> in the commit-log message:
>
> #1) Memory regions bouncing between "drop_list" and "clean_list"
> as items on the "clean_list" aren't really clean until
> their state transitions to "FRMR_IS_FREE".
>
> #2) Prevent an access error as "rds_ib_post_inv" is called
> just prior to de-referencing pages via "__rds_ib_teardown_mr".
> And you certainly don't want those pages populated in the
> HCA's memory-translation-table with full access, while
> the Linux kernel 'thinks' you gave them back already
> and starts re-purposing them.
>
>> RDS_GET_MR case is what actually showing the issue you saw
>> and the fix for that Avinash has it in production kernel.
>
> Actually, no:
> Socket option RDS_GET_MR wasn't even in the code-path of the
> tests I performed:
>
> It were there RDS_CMSG_RDMA_MAP / RDS_CMSG_RDMA_DEST control
> messages that ended up calling '__rds_rdma_map".
>
What option did you use ? Default option with rds-stress is
RDS_GET_MR and hence the question.
>>
>> I believe with that change, registration issue becomes non-issue
>> already.
>>
>
> Please explain how that is related to this fix-suggestion?
>
> I submitted this patch #3 and the others in this series in order
> to fix bugs in the RDS that is currently shipped with Linux.
>
> It may very well be the case that there are other changes
> that Avinash put into production kernels that would be better
> suited to fix this and other problems.
>
> But that should not eliminate the need to fix what is currently broken.
>
> Fixing what's broken does not preclude replacing the fixed code
> with newer or better versions of the same.
>
>> And as far as invalidation concerned with proxy qp, it not longer
>> races with data path qp.
>>
>
> I don't understand, please elaborate.
>
>> May be you can try those changes if not already to see if it
>> addresses the couple of cases where you ended up adding
>> timeouts.
>>
>
> I don't understand, please elaborate:
> a) Are you saying this issue should not be fixed?
> b) Or are you suggesting to replace this fix with a different fix?
> If it's the later, please point out what you have in mind.
> c) ???
>
All am saying is the code got changed for good reason and that changed
code makes some of these race conditions possibly not applicable.
So instead of these timeout fixes, am suggesting to use that
code as fix. At least test it with those changes and see whats
the behavior.
Regards,
Santosh
Powered by blists - more mailing lists