[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17122432-1d05-7a88-5d06-840cd69e57e9@talpey.com>
Date: Thu, 30 Dec 2021 21:32:06 -0500
From: Tom Talpey <tom@...pey.com>
To: "lizhijian@...itsu.com" <lizhijian@...itsu.com>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"zyjzyj2000@...il.com" <zyjzyj2000@...il.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
"aharonl@...dia.com" <aharonl@...dia.com>,
"leon@...nel.org" <leon@...nel.org>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mbloch@...dia.com" <mbloch@...dia.com>,
"liweihang@...wei.com" <liweihang@...wei.com>,
"liangwenpeng@...wei.com" <liangwenpeng@...wei.com>,
"yangx.jy@...itsu.com" <yangx.jy@...itsu.com>,
"rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
"y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution
in responder side
On 12/30/2021 8:37 PM, lizhijian@...itsu.com wrote
>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>> +{
>>> + int err;
>>> + int bytes;
>>> + u8 *va;
>>> + struct rxe_map **map;
>>> + struct rxe_phys_buf *buf;
>>> + int m;
>>> + int i;
>>> + size_t offset;
>>> +
>>> + if (length == 0)
>>> + return 0;
>>
>> The length is only relevant when the flush type is "Memory Region
>> Range".
>>
>> When the flush type is "Memory Region", the entire region must be
>> flushed successfully before completing the operation.
>
> Yes, currently, the length has been expanded to the MR's length in such case.
I'll dig deeper, but this is not immediately clear in this context.
A zero-length operation is however valid, even though it's a no-op,
the application may use it to ensure certain ordering constraints.
Will comment later if I have a specific concern.
>>> +
>>> + if (mr->type == IB_MR_TYPE_DMA) {
>>> + arch_wb_cache_pmem((void *)iova, length);
>>> + return 0;
>>> + }
>>
>> Are dmamr's supported for remote access? I thought that was
>> prevented on first principles now. I might suggest not allowing
>> them to be flushed in any event. There is no length restriction,
>> and it's a VERY costly operation. At a minimum, protect this
>> closely.
> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
> Thanks for the suggestion.
Well, be careful when following semantics from local behaviors. When you
are processing a command from the wire, extreme caution is needed.
ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
Sorry for shouting. :)
>>> +
>>> + WARN_ON_ONCE(!mr->cur_map_set);
>>
>> The WARN is rather pointless because the code will crash just
>> seven lines below.
>>
>>> +
>>> + err = mr_check_range(mr, iova, length);
>>> + if (err) {
>>> + err = -EFAULT;
>>> + goto err1;
>>> + }
>>> +
>>> + lookup_iova(mr, iova, &m, &i, &offset);
>>> +
>>> + map = mr->cur_map_set->map + m;
>>> + buf = map[0]->buf + i;
>>> +
>>> + while (length > 0) {
>>> + va = (u8 *)(uintptr_t)buf->addr + offset;
>>> + bytes = buf->size - offset;
>>> +
>>> + if (bytes > length)
>>> + bytes = length;
>>> +
>>> + arch_wb_cache_pmem(va, bytes);
>>> +
>>> + length -= bytes;
>>> +
>>> + offset = 0;
>>> + buf++;
>>> + i++;
>>> +
>>> + if (i == RXE_BUF_PER_MAP) {
>>> + i = 0;
>>> + map++;
>>> + buf = map[0]->buf;
>>> + }
>>> + }
>>> +
>>> + return 0;
>>> +
>>> +err1:
>>> + return err;
>>> +}
>>> +
>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>> + struct rxe_pkt_info *pkt)
>>> +{
>>> + u64 length = 0, start = qp->resp.va;
>>> + u32 sel = feth_sel(pkt);
>>> + u32 plt = feth_plt(pkt);
>>> + struct rxe_mr *mr = qp->resp.mr;
>>> +
>>> + if (sel == IB_EXT_SEL_MR_RANGE)
>>> + length = qp->resp.length;
>>> + else if (sel == IB_EXT_SEL_MR_WHOLE)
>>> + length = mr->cur_map_set->length;
>>
>> I'm going to have to think about these
>
> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
I'll review again when you decide.
>>> +
>>> + if (plt == IB_EXT_PLT_PERSIST) {
>>> + nvdimm_flush_iova(mr, start, length);
>>> + wmb(); // clwb follows by a sfence
>>> + } else if (plt == IB_EXT_PLT_GLB_VIS)
>>> + wmb(); // sfence is enough
>>
>> The persistence and global visibility bits are not mutually
>> exclusive,
> My bad, it ever appeared in my mind. o(╯□╰)o
>
>
>
>
>> and in fact persistence does not imply global
>> visibility in some platforms.
> If so, and per the SPEC, why not
> if (plt & IB_EXT_PLT_PERSIST)
> do_somethingA();
> if (plt & IB_EXT_PLT_GLB_VIS)
> do_somethingB();
At the simplest, yes. But there are many subtle other possibilities.
The global visibility is oriented toward supporting distributed
shared memory workloads, or publish/subscribe on high scale systems.
It's mainly about ensuring that all devices and CPUs see the data,
something ordinary RDMA Write does not guarantee. This often means
flushing write pipelines, possibly followed by invalidating caches.
The persistence, of course, is about ensuring the data is stored. This
is entirely different from making it visible.
In fact, you often want to optimize the two scenarios together, since
these operations are all about optimizing latency. The choice is going
to depend greatly on the behavior of the platform itself. For example,
certain caches provide persistence when batteries are present. So,
providing persistence and providing visibility are one and the same.
No need to do that twice.
On the other hand, some systems may provide persistence only after a
significant cost, such as writing into flash from a DRAM buffer, or
when an Optane DIMM becomes overloaded from continuous writes and
begins to throttle them. The flush may need some rather intricate
processing, to avoid deadlock.
Your code does not appear to envision these, so the simple way might
be best for now. But you should consider other situations.
>> Second, the "clwb" and "sfence" comments are completely
>> Intel-specific.
> good catch.
>
>
>> What processing will be done on other
>> processor platforms???
>
> I didn't dig other ARCH yet but INTEL.
> In this function, i think i just need to call the higher level wrapper, like wmb() and
> arch_wb_cache_pmem are enough, right ?
Well, higher level wrappers may signal errors, for example if they're
not available or unreliable, and you will need to handle them. Also,
they may block. Is that ok in this context?
You need to get this right on all platforms which will run this code.
It is not acceptable to guess at whether the data is in the required
state before completing the RDMA_FLUSH. If this is not guaranteed,
the operation must raise the required error. To do anything else will
violate the protocol contract, and the remote application may fail.
Tom.
Powered by blists - more mailing lists