[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1759bca6-4511-6cd9-ab5d-8c9c30e5db67@oracle.com>
Date: Tue, 2 Jul 2019 14:18:14 -0700
From: santosh.shilimkar@...cle.com
To: Gerd Rausch <gerd.rausch@...cle.com>, netdev@...r.kernel.org
Cc: David Miller <davem@...emloft.net>
Subject: Re: [PATCH net-next 3/7] net/rds: Wait for the FRMR_IS_FREE (or
FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV
On 7/2/19 2:05 PM, Gerd Rausch wrote:
> On 02/07/2019 09.49, santosh.shilimkar@...cle.com wrote:
>> On 7/1/19 10:11 PM, Gerd Rausch wrote:
>>> For the registration work-requests there is a benefit to wait a short
>>> amount of time only (the trade-off described in patch #1 of this series).
>>>
>> Actually we should just switch this code to what Avinash has
>> finally made in downstream code. That keeps the RDS_GET_MR
>> semantics and makes sure MR is really valid before handing over
>> the key to userland. There is no need for any timeout
>> for registration case.
>>
>
> What do you call "RDS_GET_MR" semantics?
>
Its a blocking socket call. Meaning after this call return to the
user, the key must be valid. With async registration that can't be
guaranteed.
> The purpose of waiting for a IB_WR_REG_MR request to complete
> (inside rds_ib_post_reg_frmr) is in fact to make sure
> the memory region is valid.
>
> Regardless of this being true after a specific time-out,
> or an infinite timeout.
>
> For the non-infinite time-out case, there is a check if the request
> was handled by the firmware.
>
> And if a time-out occurred and the firmware didn't handle the request,
> function "rds_ib_post_reg_frmr" will return -EBUSY.
>
>>> Actually, no:
>>> Socket option RDS_GET_MR wasn't even in the code-path of the
>>> tests I performed:
>>>
>>> It were there RDS_CMSG_RDMA_MAP / RDS_CMSG_RDMA_DEST control
>>> messages that ended up calling '__rds_rdma_map".
>>>
>> What option did you use ? Default option with rds-stress is
>> RDS_GET_MR and hence the question.
>>
>
> Not true!:
Its other way round. Thanks for info so default its using inline
registration instead of explicit call.
> How is socket option RDS_GET_MR special with regards to this proposed fix?
>
>>> I don't understand, please elaborate:
>>> a) Are you saying this issue should not be fixed?
>>> b) Or are you suggesting to replace this fix with a different fix?
>>> If it's the later, please point out what you have in mind.
>>> c) ???
>>>
>> All am saying is the code got changed for good reason and that changed
>> code makes some of these race conditions possibly not applicable.
>
> I don't understand this. Please elaborate.
>
>> So instead of these timeout fixes, am suggesting to use that
>> code as fix. At least test it with those changes and see whats
>> the behavior.
>>
>
> Are you suggesting to
> a) Not fix this bug right now and wait until some later point in time
When did I say that ? I said have you explored alternate approach to
fix the issue and if not could you try it out.
> b) Use a different fix. If you've got a different fix, please share.
>
I don't but its a review of the fix and possible alternate needs to
be discussed. It is not like take my fix or provide an alternate fix.
> And besides these options, is there anything wrong with this fix
> (other than the discussion of what the timeout value ought to be,
> which we can address)?
>
That timeout is a problem because it doesn't guarantee the failure
of operation since its an asyn operation for registration. Instead
of timing out if you poll the CQ for that operation completion, it
makes it full proof. That is the change Avinash has done iirc and
am requesting to look at that fix.
Other 5 fixes from the series looks fine.
Regards,
Santosh
Regards,
Santosh
Powered by blists - more mailing lists