linux-kernel - Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fd561077-358e-e38d-a7d0-5c61593eff6a@fujitsu.com>
Date:   Fri, 31 Dec 2021 01:37:45 +0000
From:   "lizhijian@...itsu.com" <lizhijian@...itsu.com>
To:     Tom Talpey <tom@...pey.com>,
        "lizhijian@...itsu.com" <lizhijian@...itsu.com>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "zyjzyj2000@...il.com" <zyjzyj2000@...il.com>,
        "jgg@...pe.ca" <jgg@...pe.ca>,
        "aharonl@...dia.com" <aharonl@...dia.com>,
        "leon@...nel.org" <leon@...nel.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mbloch@...dia.com" <mbloch@...dia.com>,
        "liweihang@...wei.com" <liweihang@...wei.com>,
        "liangwenpeng@...wei.com" <liangwenpeng@...wei.com>,
        "yangx.jy@...itsu.com" <yangx.jy@...itsu.com>,
        "rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
        "y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution
 in responder side



On 31/12/2021 06:18, Tom Talpey wrote:
> On 12/28/2021 3:07 AM, Li Zhijian wrote:
>> In contrast to other opcodes, after a series of sanity checking, FLUSH
>> opcode will do a Placement Type checking before it really do the FLUSH
>> operation. Responder will also reply NAK "Remote Access Error" if it
>> found a placement type violation.
>>
>> We will persist data via arch_wb_cache_pmem(), which could be
>> architecture specific.
>>
>> After the execution, responder would reply a responded successfully by
>> RDMA READ response of zero size.
>>
>> Signed-off-by: Li Zhijian <lizhijian@...fujitsu.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_hdr.h  |  28 ++++++
>>   drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>>   drivers/infiniband/sw/rxe/rxe_mr.c   |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++-
>>   include/uapi/rdma/ib_user_verbs.h    |  10 ++
>>   5 files changed, 169 insertions(+), 6 deletions(-)
>>
>
> <snip>
>
>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>> +{
>> +    int            err;
>> +    int            bytes;
>> +    u8            *va;
>> +    struct rxe_map        **map;
>> +    struct rxe_phys_buf    *buf;
>> +    int            m;
>> +    int            i;
>> +    size_t            offset;
>> +
>> +    if (length == 0)
>> +        return 0;
>
> The length is only relevant when the flush type is "Memory Region
> Range".
>
> When the flush type is "Memory Region", the entire region must be
> flushed successfully before completing the operation.

Yes, currently, the length has been expanded to the MR's length in such case.


>
>> +
>> +    if (mr->type == IB_MR_TYPE_DMA) {
>> +        arch_wb_cache_pmem((void *)iova, length);
>> +        return 0;
>> +    }
>
> Are dmamr's supported for remote access? I thought that was
> prevented on first principles now. I might suggest not allowing
> them to be flushed in any event. There is no length restriction,
> and it's a VERY costly operation. At a minimum, protect this
> closely.
Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
Thanks for the suggestion.



>
>> +
>> +    WARN_ON_ONCE(!mr->cur_map_set);
>
> The WARN is rather pointless because the code will crash just
> seven lines below.
>
>> +
>> +    err = mr_check_range(mr, iova, length);
>> +    if (err) {
>> +        err = -EFAULT;
>> +        goto err1;
>> +    }
>> +
>> +    lookup_iova(mr, iova, &m, &i, &offset);
>> +
>> +    map = mr->cur_map_set->map + m;
>> +    buf    = map[0]->buf + i;
>> +
>> +    while (length > 0) {
>> +        va    = (u8 *)(uintptr_t)buf->addr + offset;
>> +        bytes    = buf->size - offset;
>> +
>> +        if (bytes > length)
>> +            bytes = length;
>> +
>> +        arch_wb_cache_pmem(va, bytes);
>> +
>> +        length    -= bytes;
>> +
>> +        offset    = 0;
>> +        buf++;
>> +        i++;
>> +
>> +        if (i == RXE_BUF_PER_MAP) {
>> +            i = 0;
>> +            map++;
>> +            buf = map[0]->buf;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +err1:
>> +    return err;
>> +}
>> +
>> +static enum resp_states process_flush(struct rxe_qp *qp,
>> +                       struct rxe_pkt_info *pkt)
>> +{
>> +    u64 length = 0, start = qp->resp.va;
>> +    u32 sel = feth_sel(pkt);
>> +    u32 plt = feth_plt(pkt);
>> +    struct rxe_mr *mr = qp->resp.mr;
>> +
>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>> +        length = qp->resp.length;
>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>> +        length = mr->cur_map_set->length;
>
> I'm going to have to think about these

Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.


>> +
>> +    if (plt == IB_EXT_PLT_PERSIST) {
>> +        nvdimm_flush_iova(mr, start, length);
>> +        wmb(); // clwb follows by a sfence
>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>> +        wmb(); // sfence is enough
>
> The persistence and global visibility bits are not mutually
> exclusive,
My bad, it ever appeared in my mind. o(╯□╰)o




> and in fact persistence does not imply global
> visibility in some platforms. 
If so, and per the SPEC, why not
    if (plt & IB_EXT_PLT_PERSIST)
       do_somethingA();
    if (plt & IB_EXT_PLT_GLB_VIS)
       do_somethingB();



> They must be tested and
> processed individually.
>
>     if (plt & IB_EXT_PLT_PERSIST)
>         ...
>     else if (plt & IB_EXT_PLT_GLB_VIS)
>         ..
>
> Second, the "clwb" and "sfence" comments are completely
> Intel-specific. 
good catch.


> What processing will be done on other
> processor platforms???

I didn't dig other ARCH yet but INTEL.
In this function, i think i just need to call the higher level wrapper, like wmb() and
arch_wb_cache_pmem are enough, right ?

Again, thank you.

Thanks.




>
> Tom.