lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6580d85-2c57-4ca1-9846-7af831bfceb7@linux.dev>
Date: Mon, 28 Oct 2024 21:19:10 +0100
From: Zhu Yanjun <yanjun.zhu@...ux.dev>
To: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@...itsu.com>,
 "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
 "leon@...nel.org" <leon@...nel.org>, "jgg@...pe.ca" <jgg@...pe.ca>,
 "zyjzyj2000@...il.com" <zyjzyj2000@...il.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
 "Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
Subject: Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE

在 2024/10/28 8:59, Daisuke Matsuda (Fujitsu) 写道:
> On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
>> 在 2024/10/9 3:58, Daisuke Matsuda 写道:
>>> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
>>> driver, which has been available only in mlx5 driver[1] so far.
>>>
>>> This series has been blocked because of the hang issue of srp 002 test[2],
>>> which was believed to be caused after applying the commit 9b4b7c1f9f54
>>> ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
>>> on the commit because the ODP feature requires sleeping in kernel space,
>>> and it is impossible with the former tasklet implementation.
>>>
>>> According to the original reporter[3], the hang issue is already gone in
>>> v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
>>> driver is ready to accept this series since there is no longer any reason
>>> to consider reverting back to the old tasklet.
>>>
>>> I omitted some contents like the motive behind this series from the cover-
>>> letter. Please see the cover letter of v3 for more details[5].
>>>
>>> [Overview]
>>> When applications register a memory region(MR), RDMA drivers normally pin
>>> pages in the MR so that physical addresses are never changed during RDMA
>>> communication. This requires the MR to fit in physical memory and
>>> inevitably leads to memory pressure. On the other hand, On-Demand Paging
>>> (ODP) allows applications to register MRs without pinning pages. They are
>>> paged-in when the driver requires and paged-out when the OS reclaims. As a
>>> result, it is possible to register a large MR that does not fit in physical
>>> memory without taking up so much physical memory.
>>>
>>> [How does ODP work?]
>>> "struct ib_umem_odp" is used to manage pages. It is created for each
>>> ODP-enabled MR on its registration. This struct holds a pair of arrays
>>> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
>>> PFNs are stored in the driver page table. They are updated on page-in and
>>> page-out, both of which use the common interfaces in the ib_uverbs layer.
>>>
>>> Page-in can occur when requester, responder or completer access an MR in
>>> order to process RDMA operations. If they find that the pages being
>>> accessed are not present on physical memory or requisite permissions are
>>> not set on the pages, they provoke page fault to make the pages present
>>> with proper permissions and at the same time update the driver page table.
>>> After confirming the presence of the pages, they execute memory access such
>>> as read, write or atomic operations.
>>>
>>> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
>>> update of a file that is being used as an MR). When creating an ODP-enabled
>>> MR, the driver registers an MMU notifier callback. When the kernel issues a
>>> page invalidation notification, the callback is provoked to unmap DMA
>>> addresses and update the driver page table. After that, the kernel releases
>>> the pages.
>>>
>>> [Supported operations]
>>> All traditional operations are supported on RC connection. The new Atomic
>>> write[6] and RDMA Flush[7] operations are not included in this patchset. I
>>> will post them later after this patchset is merged. On UD connection, Send,
>>> Recv, and SRQ-Recv are supported.
>>>
>>> [How to test ODP?]
>>> There are only a few resources available for testing. pyverbs testcases in
>>> rdma-core and perftest[8] are recommendable ones. Other than them, the
>>> ibv_rc_pingpong command can also be used for testing. Note that you may
>>> have to build perftest from upstream because old versions do not handle ODP
>>> capabilities correctly.
>>
>> Thanks a lot. I have tested these patches with perftest. Because ODP (On
>> Demand Paging) is a feature, can you also add some testcases into rdma
>> core? So we can use rdma-core to make tests with this feature of rxe.
> 
> I added Read/Write/Atomics tests two years ago.
> Cf. https://github.com/linux-rdma/rdma-core/pull/1229
> 
> Each of ODP testcases causes page invalidation so that RDMA traffic
> access triggers ODP page-in flow.
> 
> Currently, 7 testcases below can pass on rxe ODP v8 implementation.
>    test_odp_rc_atomic_cmp_and_swp
>    test_odp_rc_atomic_fetch_and_add
>    test_odp_rc_mixed_mr
>    test_odp_rc_rdma_read
>    test_odp_rc_rdma_write
>    test_odp_rc_traffic
>    test_odp_ud_traffic
> The rest 11 tests are just skipped because of lack of capabilities.

Thanks. Run rdma-core, the above tests can also work successfully in my 
test environment.
I am fine with this.

Zhu Yanjun

> 
> Please let me know if you have any suggestions for improvement.
> 
> Thanks,
> Daisuke Matsuda
> 
>>
>> That is, add some testcases in run_tests.py, so use run_tests.py to
>> verify this (ODP) feature on rxe.
>>
>> Thanks,
>> Zhu Yanjun
>>
>>>
>>> The latest ODP tree is available from github:
>>> https://github.com/ddmatsu/linux/tree/odp_v8
>>>
>>> [Future work]
>>> My next work is to enable the new Atomic write[6] and RDMA Flush[7]
>>> operations with ODP. After that, I am going to implement the prefetch
>>> feature. It allows applications to trigger page fault using
>>> ibv_advise_mr(3) to optimize performance. Some existing software like
>>> librpma[9] use this feature. Additionally, I think we can also add the
>>> implicit ODP feature in the future.
>>>
>>> [1] Understanding On Demand Paging (ODP)
>>> https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
>>>
>>> [2] [bug report] blktests srp/002 hang
>>> https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
>>>
>>> [3] blktests failures with v6.10-rc1 kernel
>>> https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
>>>
>>> [4] [00/15] ethernet: Convert from tasklet to BH workqueue
>>> https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
>>>
>>> [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
>>> https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
>>>
>>> [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
>>> https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
>>>
>>> [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
>>> https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
>>>
>>> [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
>>> https://github.com/linux-rdma/perftest
>>>
>>> [9] librpma: Remote Persistent Memory Access Library
>>> https://github.com/pmem/rpma
>>>
>>> v7->v8:
>>>    1) Dropped the first patch because the same change was made by Bob Pearson.
>>>    cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
>>>    2) Rebased to 6.12.1-rc2
>>>
>>> v6->v7:
>>>    1) Rebased to 6.6.0
>>>    2) Disabled using hugepages with ODP
>>>    3) Addressed comments on v6 from Jason and Zhu
>>>      cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
>>>
>>> v5->v6:
>>>    Fixed the implementation according to Jason's suggestions
>>>      cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
>>>      cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
>>>
>>> v4->v5:
>>>    1) Rebased to 6.4.0-rc2+
>>>    2) Changed to schedule all works on responder and completer to workqueue
>>>
>>> v3->v4:
>>>    1) Re-designed functions that access MRs to use the MR xarray.
>>>    2) Rebased onto the latest jgg-for-next tree.
>>>
>>> v2->v3:
>>>    1) Removed a patch that changes the common ib_uverbs layer.
>>>    2) Re-implemented patches for conversion to workqueue.
>>>    3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
>>>    4) Fixed some functions that returned incorrect errors.
>>>    5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
>>>
>>> v1->v2:
>>>    1) Fixed a crash issue reported by Haris Iqbal.
>>>    2) Tried to make lock patters clearer as pointed out by Romanovsky.
>>>    3) Minor clean ups and fixes.
>>>
>>> Daisuke Matsuda (6):
>>>     RDMA/rxe: Make MR functions accessible from other rxe source code
>>>     RDMA/rxe: Move resp_states definition to rxe_verbs.h
>>>     RDMA/rxe: Add page invalidation support
>>>     RDMA/rxe: Allow registering MRs for On-Demand Paging
>>>     RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
>>>     RDMA/rxe: Add support for the traditional Atomic operations with ODP
>>>
>>>    drivers/infiniband/sw/rxe/Makefile    |   2 +
>>>    drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>>>    drivers/infiniband/sw/rxe/rxe.h       |  37 ----
>>>    drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
>>>    drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
>>>    drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
>>>    drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
>>>    drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
>>>    drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
>>>    9 files changed, 419 insertions(+), 53 deletions(-)
>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ