[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a751109c-ffbc-c939-c4d8-ffff8d85b017@talpey.com>
Date: Tue, 4 Jan 2022 11:02:45 -0500
From: Tom Talpey <tom@...pey.com>
To: "lizhijian@...itsu.com" <lizhijian@...itsu.com>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"zyjzyj2000@...il.com" <zyjzyj2000@...il.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
"aharonl@...dia.com" <aharonl@...dia.com>,
"leon@...nel.org" <leon@...nel.org>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mbloch@...dia.com" <mbloch@...dia.com>,
"liangwenpeng@...wei.com" <liangwenpeng@...wei.com>,
"yangx.jy@...itsu.com" <yangx.jy@...itsu.com>,
"rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
"y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution
in responder side
On 1/4/2022 3:51 AM, lizhijian@...itsu.com wrote:
>
>
> On 31/12/2021 10:32, Tom Talpey wrote:
>>
>> On 12/30/2021 8:37 PM, lizhijian@...itsu.com wrote
>>>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>>>> +{
>>>>> + int err;
>>>>> + int bytes;
>>>>> + u8 *va;
>>>>> + struct rxe_map **map;
>>>>> + struct rxe_phys_buf *buf;
>>>>> + int m;
>>>>> + int i;
>>>>> + size_t offset;
>>>>> +
>>>>> + if (length == 0)
>>>>> + return 0;
>>>>
>>>> The length is only relevant when the flush type is "Memory Region
>>>> Range".
>>>>
>>>> When the flush type is "Memory Region", the entire region must be
>>>> flushed successfully before completing the operation.
>>>
>>> Yes, currently, the length has been expanded to the MR's length in such case.
>>
>> I'll dig deeper, but this is not immediately clear in this context.
>> A zero-length operation is however valid, even though it's a no-op,
>
> If it's no-op, what shall we do in this case.
It's a no-op only in the case that the flush has the MRR (range)
flag set. When the Region flag is set, the length is ignored. As
you point out, the length is determined in process_flush().
I think the confusion arises because the routine is being used
for both cases, and the zero-length case is only valid for one.
To me, it would make more sense to make this test before calling
it. Something like:
+static enum resp_states process_flush(struct rxe_qp *qp,
+ struct rxe_pkt_info *pkt)
+{
+ u64 length = 0, start = qp->resp.va;
+ u32 sel = feth_sel(pkt);
+ u32 plt = feth_plt(pkt);
+ struct rxe_mr *mr = qp->resp.mr;
+
+ if (sel == IB_EXT_SEL_MR_RANGE)
+ length = qp->resp.length;
+ else if (sel == IB_EXT_SEL_MR_WHOLE)
{
+ length = mr->cur_map_set->length;
WARN_ON(length == 0);
}
if (length > 0) {
// there may be optimizations possible here...
if (plt & IB_EXT_PLT_PERSIST)
(flush to persistence)
if (plt & IB_EXT_PLT_GLB_VIS)
(flush to global visibility)
}
+ /* set RDMA READ response of zero */
+ qp->resp.resid = 0;
+
+ return RESPST_READ_REPLY;
+}
Tom.
>
>
>> the application may use it to ensure certain ordering constraints.
>> Will comment later if I have a specific concern.
>
> kindly welcome :)
>
>
>
>
>>
>>>>> +
>>>>> + if (mr->type == IB_MR_TYPE_DMA) {
>>>>> + arch_wb_cache_pmem((void *)iova, length);
>>>>> + return 0;
>>>>> + }
>>>>
>>>> Are dmamr's supported for remote access? I thought that was
>>>> prevented on first principles now. I might suggest not allowing
>>>> them to be flushed in any event. There is no length restriction,
>>>> and it's a VERY costly operation. At a minimum, protect this
>>>> closely.
>>> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
>>> Thanks for the suggestion.
>>
>> Well, be careful when following semantics from local behaviors. When you
>> are processing a command from the wire, extreme caution is needed.
>> ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
>>
>> Sorry for shouting. :)
>
> Never mind :)
>
>
>
>
>
>>
>>>>> +
>>>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>>>> + struct rxe_pkt_info *pkt)
>>>>> +{
>>>>> + u64 length = 0, start = qp->resp.va;
>>>>> + u32 sel = feth_sel(pkt);
>>>>> + u32 plt = feth_plt(pkt);
>>>>> + struct rxe_mr *mr = qp->resp.mr;
>>>>> +
>>>>> + if (sel == IB_EXT_SEL_MR_RANGE)
>>>>> + length = qp->resp.length;
>>>>> + else if (sel == IB_EXT_SEL_MR_WHOLE)
>>>>> + length = mr->cur_map_set->length;
>>>>
>>>> I'm going to have to think about these
>>>
>>> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>>
>> I'll review again when you decide.
>>>>> +
>>>>> + if (plt == IB_EXT_PLT_PERSIST) {
>>>>> + nvdimm_flush_iova(mr, start, length);
>>>>> + wmb(); // clwb follows by a sfence
>>>>> + } else if (plt == IB_EXT_PLT_GLB_VIS)
>>>>> + wmb(); // sfence is enough
>>>>
>>>> The persistence and global visibility bits are not mutually
>>>> exclusive,
>>> My bad, it ever appeared in my mind. o(╯□╰)o
>>>
>>>
>>>
>>>
>>>> and in fact persistence does not imply global
>>>> visibility in some platforms.
>>> If so, and per the SPEC, why not
>>> if (plt & IB_EXT_PLT_PERSIST)
>>> do_somethingA();
>>> if (plt & IB_EXT_PLT_GLB_VIS)
>>> do_somethingB();
>>
>> At the simplest, yes. But there are many subtle other possibilities.
>>
>> The global visibility is oriented toward supporting distributed
>> shared memory workloads, or publish/subscribe on high scale systems.
>> It's mainly about ensuring that all devices and CPUs see the data,
>> something ordinary RDMA Write does not guarantee. This often means
>> flushing write pipelines, possibly followed by invalidating caches.
>>
>> The persistence, of course, is about ensuring the data is stored. This
>> is entirely different from making it visible.
>>
>> In fact, you often want to optimize the two scenarios together, since
>> these operations are all about optimizing latency. The choice is going
>> to depend greatly on the behavior of the platform itself. For example,
>> certain caches provide persistence when batteries are present. So,
>> providing persistence and providing visibility are one and the same.
>> No need to do that twice.
>>
>> On the other hand, some systems may provide persistence only after a
>> significant cost, such as writing into flash from a DRAM buffer, or
>> when an Optane DIMM becomes overloaded from continuous writes and
>> begins to throttle them. The flush may need some rather intricate
>> processing, to avoid deadlock.
>>
>> Your code does not appear to envision these, so the simple way might
>> be best for now. But you should consider other situations.
>
> Thanks a lot for your detailed explanation.
>
>
>
>>
>>>> Second, the "clwb" and "sfence" comments are completely
>>>> Intel-specific.
>>> good catch.
>>>
>>>
>>>> What processing will be done on other
>>>> processor platforms???
>>>
>>> I didn't dig other ARCH yet but INTEL.
>>> In this function, i think i just need to call the higher level wrapper, like wmb() and
>>> arch_wb_cache_pmem are enough, right ?
>>
>> Well, higher level wrappers may signal errors, for example if they're
>> not available or unreliable, and you will need to handle them. Also,
>> they may block. Is that ok in this context?
>
> Good question.
>
> My bad, i forgot to check to return value of nvdimm_flush_iova() previously.
>
> But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
> idea yet.
>
> arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
> the assembly instructions on different architectures. And they are void return.
>
> I wonder if we can always assume they work always When the code reaches them.
> Since all current local flushing to pmem do something like that AFAIK(Feel free
> to correct me if i'm wrong)
>
> Thanks
> Zhijian
>
>>
>> You need to get this right on all platforms which will run this code.
>> It is not acceptable to guess at whether the data is in the required
>> state before completing the RDMA_FLUSH. If this is not guaranteed,
>> the operation must raise the required error. To do anything else will
>> violate the protocol contract, and the remote application may fail.
>
>
>>
>> Tom.
Powered by blists - more mailing lists