[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fd561077-358e-e38d-a7d0-5c61593eff6a@fujitsu.com>
Date: Fri, 31 Dec 2021 01:37:45 +0000
From: "lizhijian@...itsu.com" <lizhijian@...itsu.com>
To: Tom Talpey <tom@...pey.com>,
"lizhijian@...itsu.com" <lizhijian@...itsu.com>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"zyjzyj2000@...il.com" <zyjzyj2000@...il.com>,
"jgg@...pe.ca" <jgg@...pe.ca>,
"aharonl@...dia.com" <aharonl@...dia.com>,
"leon@...nel.org" <leon@...nel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mbloch@...dia.com" <mbloch@...dia.com>,
"liweihang@...wei.com" <liweihang@...wei.com>,
"liangwenpeng@...wei.com" <liangwenpeng@...wei.com>,
"yangx.jy@...itsu.com" <yangx.jy@...itsu.com>,
"rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
"y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution
in responder side
On 31/12/2021 06:18, Tom Talpey wrote:
> On 12/28/2021 3:07 AM, Li Zhijian wrote:
>> In contrast to other opcodes, after a series of sanity checking, FLUSH
>> opcode will do a Placement Type checking before it really do the FLUSH
>> operation. Responder will also reply NAK "Remote Access Error" if it
>> found a placement type violation.
>>
>> We will persist data via arch_wb_cache_pmem(), which could be
>> architecture specific.
>>
>> After the execution, responder would reply a responded successfully by
>> RDMA READ response of zero size.
>>
>> Signed-off-by: Li Zhijian <lizhijian@...fujitsu.com>
>> ---
>> drivers/infiniband/sw/rxe/rxe_hdr.h | 28 ++++++
>> drivers/infiniband/sw/rxe/rxe_loc.h | 2 +
>> drivers/infiniband/sw/rxe/rxe_mr.c | 4 +-
>> drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++-
>> include/uapi/rdma/ib_user_verbs.h | 10 ++
>> 5 files changed, 169 insertions(+), 6 deletions(-)
>>
>
> <snip>
>
>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>> +{
>> + int err;
>> + int bytes;
>> + u8 *va;
>> + struct rxe_map **map;
>> + struct rxe_phys_buf *buf;
>> + int m;
>> + int i;
>> + size_t offset;
>> +
>> + if (length == 0)
>> + return 0;
>
> The length is only relevant when the flush type is "Memory Region
> Range".
>
> When the flush type is "Memory Region", the entire region must be
> flushed successfully before completing the operation.
Yes, currently, the length has been expanded to the MR's length in such case.
>
>> +
>> + if (mr->type == IB_MR_TYPE_DMA) {
>> + arch_wb_cache_pmem((void *)iova, length);
>> + return 0;
>> + }
>
> Are dmamr's supported for remote access? I thought that was
> prevented on first principles now. I might suggest not allowing
> them to be flushed in any event. There is no length restriction,
> and it's a VERY costly operation. At a minimum, protect this
> closely.
Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
Thanks for the suggestion.
>
>> +
>> + WARN_ON_ONCE(!mr->cur_map_set);
>
> The WARN is rather pointless because the code will crash just
> seven lines below.
>
>> +
>> + err = mr_check_range(mr, iova, length);
>> + if (err) {
>> + err = -EFAULT;
>> + goto err1;
>> + }
>> +
>> + lookup_iova(mr, iova, &m, &i, &offset);
>> +
>> + map = mr->cur_map_set->map + m;
>> + buf = map[0]->buf + i;
>> +
>> + while (length > 0) {
>> + va = (u8 *)(uintptr_t)buf->addr + offset;
>> + bytes = buf->size - offset;
>> +
>> + if (bytes > length)
>> + bytes = length;
>> +
>> + arch_wb_cache_pmem(va, bytes);
>> +
>> + length -= bytes;
>> +
>> + offset = 0;
>> + buf++;
>> + i++;
>> +
>> + if (i == RXE_BUF_PER_MAP) {
>> + i = 0;
>> + map++;
>> + buf = map[0]->buf;
>> + }
>> + }
>> +
>> + return 0;
>> +
>> +err1:
>> + return err;
>> +}
>> +
>> +static enum resp_states process_flush(struct rxe_qp *qp,
>> + struct rxe_pkt_info *pkt)
>> +{
>> + u64 length = 0, start = qp->resp.va;
>> + u32 sel = feth_sel(pkt);
>> + u32 plt = feth_plt(pkt);
>> + struct rxe_mr *mr = qp->resp.mr;
>> +
>> + if (sel == IB_EXT_SEL_MR_RANGE)
>> + length = qp->resp.length;
>> + else if (sel == IB_EXT_SEL_MR_WHOLE)
>> + length = mr->cur_map_set->length;
>
> I'm going to have to think about these
Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>> +
>> + if (plt == IB_EXT_PLT_PERSIST) {
>> + nvdimm_flush_iova(mr, start, length);
>> + wmb(); // clwb follows by a sfence
>> + } else if (plt == IB_EXT_PLT_GLB_VIS)
>> + wmb(); // sfence is enough
>
> The persistence and global visibility bits are not mutually
> exclusive,
My bad, it ever appeared in my mind. o(╯□╰)o
> and in fact persistence does not imply global
> visibility in some platforms.
If so, and per the SPEC, why not
if (plt & IB_EXT_PLT_PERSIST)
do_somethingA();
if (plt & IB_EXT_PLT_GLB_VIS)
do_somethingB();
> They must be tested and
> processed individually.
>
> if (plt & IB_EXT_PLT_PERSIST)
> ...
> else if (plt & IB_EXT_PLT_GLB_VIS)
> ..
>
> Second, the "clwb" and "sfence" comments are completely
> Intel-specific.
good catch.
> What processing will be done on other
> processor platforms???
I didn't dig other ARCH yet but INTEL.
In this function, i think i just need to call the higher level wrapper, like wmb() and
arch_wb_cache_pmem are enough, right ?
Again, thank you.
Thanks.
>
> Tom.
Powered by blists - more mailing lists