[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<OS3PR01MB9865DCDAEDDA8187267429AFE54A2@OS3PR01MB9865.jpnprd01.prod.outlook.com>
Date: Mon, 28 Oct 2024 07:59:38 +0000
From: "Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@...itsu.com>
To: 'Zhu Yanjun' <yanjun.zhu@...ux.dev>, "linux-rdma@...r.kernel.org"
<linux-rdma@...r.kernel.org>, "leon@...nel.org" <leon@...nel.org>,
"jgg@...pe.ca" <jgg@...pe.ca>, "zyjzyj2000@...il.com" <zyjzyj2000@...il.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rpearsonhpe@...il.com" <rpearsonhpe@...il.com>, "Zhijian Li (Fujitsu)"
<lizhijian@...itsu.com>
Subject: RE: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
> 在 2024/10/9 3:58, Daisuke Matsuda 写道:
> > This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> > driver, which has been available only in mlx5 driver[1] so far.
> >
> > This series has been blocked because of the hang issue of srp 002 test[2],
> > which was believed to be caused after applying the commit 9b4b7c1f9f54
> > ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
> > on the commit because the ODP feature requires sleeping in kernel space,
> > and it is impossible with the former tasklet implementation.
> >
> > According to the original reporter[3], the hang issue is already gone in
> > v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
> > driver is ready to accept this series since there is no longer any reason
> > to consider reverting back to the old tasklet.
> >
> > I omitted some contents like the motive behind this series from the cover-
> > letter. Please see the cover letter of v3 for more details[5].
> >
> > [Overview]
> > When applications register a memory region(MR), RDMA drivers normally pin
> > pages in the MR so that physical addresses are never changed during RDMA
> > communication. This requires the MR to fit in physical memory and
> > inevitably leads to memory pressure. On the other hand, On-Demand Paging
> > (ODP) allows applications to register MRs without pinning pages. They are
> > paged-in when the driver requires and paged-out when the OS reclaims. As a
> > result, it is possible to register a large MR that does not fit in physical
> > memory without taking up so much physical memory.
> >
> > [How does ODP work?]
> > "struct ib_umem_odp" is used to manage pages. It is created for each
> > ODP-enabled MR on its registration. This struct holds a pair of arrays
> > (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> > PFNs are stored in the driver page table. They are updated on page-in and
> > page-out, both of which use the common interfaces in the ib_uverbs layer.
> >
> > Page-in can occur when requester, responder or completer access an MR in
> > order to process RDMA operations. If they find that the pages being
> > accessed are not present on physical memory or requisite permissions are
> > not set on the pages, they provoke page fault to make the pages present
> > with proper permissions and at the same time update the driver page table.
> > After confirming the presence of the pages, they execute memory access such
> > as read, write or atomic operations.
> >
> > Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> > update of a file that is being used as an MR). When creating an ODP-enabled
> > MR, the driver registers an MMU notifier callback. When the kernel issues a
> > page invalidation notification, the callback is provoked to unmap DMA
> > addresses and update the driver page table. After that, the kernel releases
> > the pages.
> >
> > [Supported operations]
> > All traditional operations are supported on RC connection. The new Atomic
> > write[6] and RDMA Flush[7] operations are not included in this patchset. I
> > will post them later after this patchset is merged. On UD connection, Send,
> > Recv, and SRQ-Recv are supported.
> >
> > [How to test ODP?]
> > There are only a few resources available for testing. pyverbs testcases in
> > rdma-core and perftest[8] are recommendable ones. Other than them, the
> > ibv_rc_pingpong command can also be used for testing. Note that you may
> > have to build perftest from upstream because old versions do not handle ODP
> > capabilities correctly.
>
> Thanks a lot. I have tested these patches with perftest. Because ODP (On
> Demand Paging) is a feature, can you also add some testcases into rdma
> core? So we can use rdma-core to make tests with this feature of rxe.
I added Read/Write/Atomics tests two years ago.
Cf. https://github.com/linux-rdma/rdma-core/pull/1229
Each of ODP testcases causes page invalidation so that RDMA traffic
access triggers ODP page-in flow.
Currently, 7 testcases below can pass on rxe ODP v8 implementation.
test_odp_rc_atomic_cmp_and_swp
test_odp_rc_atomic_fetch_and_add
test_odp_rc_mixed_mr
test_odp_rc_rdma_read
test_odp_rc_rdma_write
test_odp_rc_traffic
test_odp_ud_traffic
The rest 11 tests are just skipped because of lack of capabilities.
Please let me know if you have any suggestions for improvement.
Thanks,
Daisuke Matsuda
>
> That is, add some testcases in run_tests.py, so use run_tests.py to
> verify this (ODP) feature on rxe.
>
> Thanks,
> Zhu Yanjun
>
> >
> > The latest ODP tree is available from github:
> > https://github.com/ddmatsu/linux/tree/odp_v8
> >
> > [Future work]
> > My next work is to enable the new Atomic write[6] and RDMA Flush[7]
> > operations with ODP. After that, I am going to implement the prefetch
> > feature. It allows applications to trigger page fault using
> > ibv_advise_mr(3) to optimize performance. Some existing software like
> > librpma[9] use this feature. Additionally, I think we can also add the
> > implicit ODP feature in the future.
> >
> > [1] Understanding On Demand Paging (ODP)
> > https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
> >
> > [2] [bug report] blktests srp/002 hang
> > https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
> >
> > [3] blktests failures with v6.10-rc1 kernel
> > https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
> >
> > [4] [00/15] ethernet: Convert from tasklet to BH workqueue
> > https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
> >
> > [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> > https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
> >
> > [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
> > https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
> >
> > [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
> > https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
> >
> > [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
> > https://github.com/linux-rdma/perftest
> >
> > [9] librpma: Remote Persistent Memory Access Library
> > https://github.com/pmem/rpma
> >
> > v7->v8:
> > 1) Dropped the first patch because the same change was made by Bob Pearson.
> > cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
> > 2) Rebased to 6.12.1-rc2
> >
> > v6->v7:
> > 1) Rebased to 6.6.0
> > 2) Disabled using hugepages with ODP
> > 3) Addressed comments on v6 from Jason and Zhu
> > cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
> >
> > v5->v6:
> > Fixed the implementation according to Jason's suggestions
> > cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
> > cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
> >
> > v4->v5:
> > 1) Rebased to 6.4.0-rc2+
> > 2) Changed to schedule all works on responder and completer to workqueue
> >
> > v3->v4:
> > 1) Re-designed functions that access MRs to use the MR xarray.
> > 2) Rebased onto the latest jgg-for-next tree.
> >
> > v2->v3:
> > 1) Removed a patch that changes the common ib_uverbs layer.
> > 2) Re-implemented patches for conversion to workqueue.
> > 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
> > 4) Fixed some functions that returned incorrect errors.
> > 5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
> >
> > v1->v2:
> > 1) Fixed a crash issue reported by Haris Iqbal.
> > 2) Tried to make lock patters clearer as pointed out by Romanovsky.
> > 3) Minor clean ups and fixes.
> >
> > Daisuke Matsuda (6):
> > RDMA/rxe: Make MR functions accessible from other rxe source code
> > RDMA/rxe: Move resp_states definition to rxe_verbs.h
> > RDMA/rxe: Add page invalidation support
> > RDMA/rxe: Allow registering MRs for On-Demand Paging
> > RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
> > RDMA/rxe: Add support for the traditional Atomic operations with ODP
> >
> > drivers/infiniband/sw/rxe/Makefile | 2 +
> > drivers/infiniband/sw/rxe/rxe.c | 18 ++
> > drivers/infiniband/sw/rxe/rxe.h | 37 ----
> > drivers/infiniband/sw/rxe/rxe_loc.h | 39 ++++
> > drivers/infiniband/sw/rxe/rxe_mr.c | 34 +++-
> > drivers/infiniband/sw/rxe/rxe_odp.c | 282 ++++++++++++++++++++++++++
> > drivers/infiniband/sw/rxe/rxe_resp.c | 18 +-
> > drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
> > drivers/infiniband/sw/rxe/rxe_verbs.h | 37 ++++
> > 9 files changed, 419 insertions(+), 53 deletions(-)
> > create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> >
Powered by blists - more mailing lists