[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6580d85-2c57-4ca1-9846-7af831bfceb7@linux.dev>
Date: Mon, 28 Oct 2024 21:19:10 +0100
From: Zhu Yanjun <yanjun.zhu@...ux.dev>
To: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@...itsu.com>,
"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"leon@...nel.org" <leon@...nel.org>, "jgg@...pe.ca" <jgg@...pe.ca>,
"zyjzyj2000@...il.com" <zyjzyj2000@...il.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rpearsonhpe@...il.com" <rpearsonhpe@...il.com>,
"Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
Subject: Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
在 2024/10/28 8:59, Daisuke Matsuda (Fujitsu) 写道:
> On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
>> 在 2024/10/9 3:58, Daisuke Matsuda 写道:
>>> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
>>> driver, which has been available only in mlx5 driver[1] so far.
>>>
>>> This series has been blocked because of the hang issue of srp 002 test[2],
>>> which was believed to be caused after applying the commit 9b4b7c1f9f54
>>> ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
>>> on the commit because the ODP feature requires sleeping in kernel space,
>>> and it is impossible with the former tasklet implementation.
>>>
>>> According to the original reporter[3], the hang issue is already gone in
>>> v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
>>> driver is ready to accept this series since there is no longer any reason
>>> to consider reverting back to the old tasklet.
>>>
>>> I omitted some contents like the motive behind this series from the cover-
>>> letter. Please see the cover letter of v3 for more details[5].
>>>
>>> [Overview]
>>> When applications register a memory region(MR), RDMA drivers normally pin
>>> pages in the MR so that physical addresses are never changed during RDMA
>>> communication. This requires the MR to fit in physical memory and
>>> inevitably leads to memory pressure. On the other hand, On-Demand Paging
>>> (ODP) allows applications to register MRs without pinning pages. They are
>>> paged-in when the driver requires and paged-out when the OS reclaims. As a
>>> result, it is possible to register a large MR that does not fit in physical
>>> memory without taking up so much physical memory.
>>>
>>> [How does ODP work?]
>>> "struct ib_umem_odp" is used to manage pages. It is created for each
>>> ODP-enabled MR on its registration. This struct holds a pair of arrays
>>> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
>>> PFNs are stored in the driver page table. They are updated on page-in and
>>> page-out, both of which use the common interfaces in the ib_uverbs layer.
>>>
>>> Page-in can occur when requester, responder or completer access an MR in
>>> order to process RDMA operations. If they find that the pages being
>>> accessed are not present on physical memory or requisite permissions are
>>> not set on the pages, they provoke page fault to make the pages present
>>> with proper permissions and at the same time update the driver page table.
>>> After confirming the presence of the pages, they execute memory access such
>>> as read, write or atomic operations.
>>>
>>> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
>>> update of a file that is being used as an MR). When creating an ODP-enabled
>>> MR, the driver registers an MMU notifier callback. When the kernel issues a
>>> page invalidation notification, the callback is provoked to unmap DMA
>>> addresses and update the driver page table. After that, the kernel releases
>>> the pages.
>>>
>>> [Supported operations]
>>> All traditional operations are supported on RC connection. The new Atomic
>>> write[6] and RDMA Flush[7] operations are not included in this patchset. I
>>> will post them later after this patchset is merged. On UD connection, Send,
>>> Recv, and SRQ-Recv are supported.
>>>
>>> [How to test ODP?]
>>> There are only a few resources available for testing. pyverbs testcases in
>>> rdma-core and perftest[8] are recommendable ones. Other than them, the
>>> ibv_rc_pingpong command can also be used for testing. Note that you may
>>> have to build perftest from upstream because old versions do not handle ODP
>>> capabilities correctly.
>>
>> Thanks a lot. I have tested these patches with perftest. Because ODP (On
>> Demand Paging) is a feature, can you also add some testcases into rdma
>> core? So we can use rdma-core to make tests with this feature of rxe.
>
> I added Read/Write/Atomics tests two years ago.
> Cf. https://github.com/linux-rdma/rdma-core/pull/1229
>
> Each of ODP testcases causes page invalidation so that RDMA traffic
> access triggers ODP page-in flow.
>
> Currently, 7 testcases below can pass on rxe ODP v8 implementation.
> test_odp_rc_atomic_cmp_and_swp
> test_odp_rc_atomic_fetch_and_add
> test_odp_rc_mixed_mr
> test_odp_rc_rdma_read
> test_odp_rc_rdma_write
> test_odp_rc_traffic
> test_odp_ud_traffic
> The rest 11 tests are just skipped because of lack of capabilities.
Thanks. Run rdma-core, the above tests can also work successfully in my
test environment.
I am fine with this.
Zhu Yanjun
>
> Please let me know if you have any suggestions for improvement.
>
> Thanks,
> Daisuke Matsuda
>
>>
>> That is, add some testcases in run_tests.py, so use run_tests.py to
>> verify this (ODP) feature on rxe.
>>
>> Thanks,
>> Zhu Yanjun
>>
>>>
>>> The latest ODP tree is available from github:
>>> https://github.com/ddmatsu/linux/tree/odp_v8
>>>
>>> [Future work]
>>> My next work is to enable the new Atomic write[6] and RDMA Flush[7]
>>> operations with ODP. After that, I am going to implement the prefetch
>>> feature. It allows applications to trigger page fault using
>>> ibv_advise_mr(3) to optimize performance. Some existing software like
>>> librpma[9] use this feature. Additionally, I think we can also add the
>>> implicit ODP feature in the future.
>>>
>>> [1] Understanding On Demand Paging (ODP)
>>> https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
>>>
>>> [2] [bug report] blktests srp/002 hang
>>> https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
>>>
>>> [3] blktests failures with v6.10-rc1 kernel
>>> https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
>>>
>>> [4] [00/15] ethernet: Convert from tasklet to BH workqueue
>>> https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
>>>
>>> [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
>>> https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
>>>
>>> [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
>>> https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
>>>
>>> [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
>>> https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
>>>
>>> [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
>>> https://github.com/linux-rdma/perftest
>>>
>>> [9] librpma: Remote Persistent Memory Access Library
>>> https://github.com/pmem/rpma
>>>
>>> v7->v8:
>>> 1) Dropped the first patch because the same change was made by Bob Pearson.
>>> cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
>>> 2) Rebased to 6.12.1-rc2
>>>
>>> v6->v7:
>>> 1) Rebased to 6.6.0
>>> 2) Disabled using hugepages with ODP
>>> 3) Addressed comments on v6 from Jason and Zhu
>>> cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
>>>
>>> v5->v6:
>>> Fixed the implementation according to Jason's suggestions
>>> cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
>>> cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
>>>
>>> v4->v5:
>>> 1) Rebased to 6.4.0-rc2+
>>> 2) Changed to schedule all works on responder and completer to workqueue
>>>
>>> v3->v4:
>>> 1) Re-designed functions that access MRs to use the MR xarray.
>>> 2) Rebased onto the latest jgg-for-next tree.
>>>
>>> v2->v3:
>>> 1) Removed a patch that changes the common ib_uverbs layer.
>>> 2) Re-implemented patches for conversion to workqueue.
>>> 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
>>> 4) Fixed some functions that returned incorrect errors.
>>> 5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
>>>
>>> v1->v2:
>>> 1) Fixed a crash issue reported by Haris Iqbal.
>>> 2) Tried to make lock patters clearer as pointed out by Romanovsky.
>>> 3) Minor clean ups and fixes.
>>>
>>> Daisuke Matsuda (6):
>>> RDMA/rxe: Make MR functions accessible from other rxe source code
>>> RDMA/rxe: Move resp_states definition to rxe_verbs.h
>>> RDMA/rxe: Add page invalidation support
>>> RDMA/rxe: Allow registering MRs for On-Demand Paging
>>> RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
>>> RDMA/rxe: Add support for the traditional Atomic operations with ODP
>>>
>>> drivers/infiniband/sw/rxe/Makefile | 2 +
>>> drivers/infiniband/sw/rxe/rxe.c | 18 ++
>>> drivers/infiniband/sw/rxe/rxe.h | 37 ----
>>> drivers/infiniband/sw/rxe/rxe_loc.h | 39 ++++
>>> drivers/infiniband/sw/rxe/rxe_mr.c | 34 +++-
>>> drivers/infiniband/sw/rxe/rxe_odp.c | 282 ++++++++++++++++++++++++++
>>> drivers/infiniband/sw/rxe/rxe_resp.c | 18 +-
>>> drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
>>> drivers/infiniband/sw/rxe/rxe_verbs.h | 37 ++++
>>> 9 files changed, 419 insertions(+), 53 deletions(-)
>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>>>
>
Powered by blists - more mailing lists