[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAJ3xEMj8BJLJvetNoqGOFhqeJHupEd=kbnFeO_Na=1ZYbvBk9g@mail.gmail.com>
Date: Wed, 3 Sep 2014 23:21:03 +0300
From: Or Gerlitz <gerlitz.or@...il.com>
To: Roland Dreier <roland@...nel.org>
Cc: linux-rdma@...r.kernel.org,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Sagi Grimberg <sagig@...lanox.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 for-next 00/16] On demand paging
On Tue, Sep 2, 2014, Or Gerlitz <ogerlitz@...lanox.com> wrote:
> On 7/3/2014 11:44 AM, Haggai Eran wrote:
>>
>> Hi Roland,
>>
>> I understand that you were reluctant to review these patches as long as
>> there was an ongoing debate on whether or not the i_mmap_mutex should be
>> changed into a spinlock.
>>
>> It seems that the debate concluded with the decision to change it into a
>> rwsem [1], as apparently this provides the optimal performance with the new
>> optimistic spinning patch [2].
>>
>> I believe this means that there will be no problem adding paging support
>> to the RDMA stack that depends on sleepable MMU notifiers.
>
>
> Hi Roland,
>
> The ODP patch set was initially posted whole six months ago (March 2nd,
> 2014). We did it prior to LSF so you can discuss that with Sagi while he's
> there. Well no comment from your side so far. It's really (really) hard to
> do proper kernel development when the sub-system maintainer doesn't provide
> you almost no concrete feedback over half a year.
>
> Can you please go ahead and tell us your position re this features/patches?
Hi Roland,
Bump. Can you comment here? these patches were worked out here for
long time by a dedicated group and implement a strategic feature for
the RDMA industry.
I don't see why the RDMA kernel maintainer can leave the development
team in the air without any comment on their work for half a year.
Or.
>> Changes from V0: http://marc.info/?l=linux-rdma&m=139375790322547&w=2
>>
>> - Rebased against latest upstream / for-next branch.
>> - Removed dependency on patches that were accepted upstream.
>> - Removed pre-patches that were accepted upstream [3].
>> - Add extended uverb call for querying device (patch 1) and use kernel
>> device
>> attributes to report ODP capabilities through the new uverb entry
>> instead of
>> having a special verb.
>> - Allow upgrading page access permissions during page faults.
>> - Minor fixes to issues that came up during regression testing of the
>> patches.
>>
>> The following set of patches implements on-demand paging (ODP) support
>> in the RDMA stack and in the mlx5_ib Infiniband driver.
>>
>> What is on-demand paging?
>>
>> Applications register memory with an RDMA adapter using system calls,
>> and subsequently post IO operations that refer to the corresponding
>> virtual addresses directly to HW. Until now, this was achieved by
>> pinning the memory during the registration calls. The goal of on demand
>> paging is to avoid pinning the pages of registered memory regions (MRs).
>> This will allow users the same flexibility they get when swapping any
>> other part of their processes address spaces. Instead of requiring the
>> entire MR to fit in physical memory, we can allow the MR to be larger,
>> and only fit the current working set in physical memory.
>>
>> This can make programming with RDMA much simpler. Today, developers that
>> are working with more data than their RAM can hold need either to
>> deregister and reregister memory regions throughout their process's
>> life, or keep a single memory region and copy the data to it. On demand
>> paging will allow these developers to register a single MR at the
>> beginning of their process's life, and let the operating system manage
>> which pages needs to be fetched at a given time. In the future, we might
>> be able to provide a single memory access key for each process that
>> would provide the entire process's address as one large memory region,
>> and the developers wouldn't need to register memory regions at all.
>>
>> How does page faults generally work?
>>
>> With pinned memory regions, the driver would map the virtual addresses
>> to bus addresses, and pass these addresses to the HCA to associate them
>> with the new MR. With ODP, the driver is now allowed to mark some of the
>> pages in the MR as not-present. When the HCA attempts to perform memory
>> access for a communication operation, it notices the page is not
>> present, and raises a page fault event to the driver. In addition, the
>> HCA performs whatever operation is required by the transport protocol to
>> suspend communication until the page fault is resolved.
>>
>> Upon receiving the page fault interrupt, the driver first needs to know
>> on which virtual address the page fault occurred, and on what memory
>> key. When handling send/receive operations, this information is inside
>> the work queue. The driver reads the needed work queue elements, and
>> parses them to gather the address and memory key. For other RDMA
>> operations, the event generated by the HCA only contains the virtual
>> address and rkey, as there are no work queue elements involved.
>>
>> Having the rkey, the driver can find the relevant memory region in its
>> data structures, and calculate the actual pages needed to complete the
>> operation. It then uses get_user_pages to retrieve the needed pages back
>> to the memory, obtains dma mapping, and passes the addresses to the HCA.
>> Finally, the driver notifies the HCA it can continue operation on the
>> queue pair that encountered the page fault. The pages that
>> get_user_pages returned are unpinned immediately by releasing their
>> reference.
>>
>> How are invalidations handled?
>>
>> The patches add infrastructure to subscribe the RDMA stack as an mmu
>> notifier client [4]. Each process that uses ODP register a notifier
>> client.
>> When receiving page invalidation notifications, they are passed to the
>> mlx5_ib driver, which updates the HCA with new, not-present mappings.
>> Only after flushing the HCA's page table caches the notifier returns,
>> allowing the kernel to release the pages.
>>
>> What operations are supported?
>>
>> Currently only send, receive and RDMA write operations are supported on
>> the
>> RC protocol, and also send operations on the UD protocol. We hope to
>> implement support for other transports and operations in the future.
>>
>> The structure of the patchset
>>
>> Patches 1-6:
>> The first set of patches adds page fault support to the IB core layer,
>> allowing MRs to be registered without their pages to be pinned. Patch 1
>> adds an extended verb to query device attributes, and patch 2
>> adds capability bits, configuration options, and a method for querying
>> whether the paging capabilities from user-space. The next two patches
>> (3-4)
>> make some necessary changes to the ib_umem type. Patches 5 and 6 add
>> paging support and invalidation support respectively.
>>
>> Patches 7-12:
>> This set of patches add small size new functionality to the mlx5 driver
>> and
>> builds toward paging support. Patch 7 make changes to UMR mechanism
>> (an internal mechanism used by mlx5 to update device page mappings).
>> Patch 8 adds infrastructure support for page fault handling to the
>> mlx5_core module. Patch 9 queries the device for paging capabilities, and
>> patch 11 adds a function to do partial device page table updates. Finally,
>> patch 12 adds a helper function to read information from user-space work
>> queues in the driver's context.
>>
>> Patches 13-16:
>> The final part of this patch set finally adds paging support to the mlx5
>> driver. Patch 13 adds in mlx5_ib the infrastructure to handle page faults
>> coming from mlx5_core. Patch 14 adds the code to handle UD send page
>> faults
>> and RC send and receive page faults. Patch 15 adds support for page faults
>> caused by RDMA write operations, and patch 16 adds invalidation support to
>> the mlx5 driver, allowing pages to be unmapped dynamically.
>>
>> [1] [PATCH 0/5] mm: i_mmap_mutex to rwsem
>> https://lkml.org/lkml/2013/6/24/683
>>
>> [2] Re: Performance regression from switching lock to rw-sem for anon-vma
>> tree
>> https://lkml.org/lkml/2013/6/17/452
>>
>> [3] pre-patches that were accepted upstream:
>> a74d241 IB/mlx5: Refactor UMR to have its own context struct
>> 48fea83 IB/mlx5: Set QP offsets and parameters for user QPs and not
>> just for kernel QPs
>> b475598 mlx5_core: Store MR attributes in mlx5_mr_core during creation
>> and after UMR
>> 8605933 IB/mlx5: Add MR to radix tree in reg_mr_callback
>>
>> [4] Integrating KVM with the Linux Memory Management (presentation),
>> Andrea Archangeli
>>
>> http://www.linux-kvm.org/wiki/images/3/33/KvmForum2008%24kdf2008_15.pdf
>>
>>
>> Haggai Eran (11):
>> IB/core: Add an extended user verb to query device attributes
>> IB/core: Replace ib_umem's offset field with a full address
>> IB/core: Add umem function to read data from user-space
>> IB/mlx5: Enhance UMR support to allow partial page table update
>> net/mlx5_core: Add support for page faults events and low level
>> handling
>> IB/mlx5: Implement the ODP capability query verb
>> IB/mlx5: Changes in memory region creation to support on-demand
>> paging
>> IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
>> IB/mlx5: Add function to read WQE from user-space
>> IB/mlx5: Page faults handling infrastructure
>> IB/mlx5: Handle page faults
>>
>> Sagi Grimberg (1):
>> IB/core: Add flags for on demand paging support
>>
>> Shachar Raindel (4):
>> IB/core: Add support for on demand paging regions
>> IB/core: Implement support for MMU notifiers regarding on demand
>> paging regions
>> IB/mlx5: Add support for RDMA write responder page faults
>> IB/mlx5: Implement on demand paging by adding support for MMU
>> notifiers
>>
>> drivers/infiniband/Kconfig | 11 +
>> drivers/infiniband/core/Makefile | 1 +
>> drivers/infiniband/core/umem.c | 63 +-
>> drivers/infiniband/core/umem_odp.c | 620
>> ++++++++++++++++++++
>> drivers/infiniband/core/umem_rbtree.c | 94 +++
>> drivers/infiniband/core/uverbs.h | 1 +
>> drivers/infiniband/core/uverbs_cmd.c | 170 ++++--
>> drivers/infiniband/core/uverbs_main.c | 5 +-
>> drivers/infiniband/hw/amso1100/c2_provider.c | 2 +-
>> drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +-
>> drivers/infiniband/hw/ipath/ipath_mr.c | 2 +-
>> drivers/infiniband/hw/mlx5/Makefile | 1 +
>> drivers/infiniband/hw/mlx5/main.c | 39 +-
>> drivers/infiniband/hw/mlx5/mem.c | 67 ++-
>> drivers/infiniband/hw/mlx5/mlx5_ib.h | 114 +++-
>> drivers/infiniband/hw/mlx5/mr.c | 303 ++++++++--
>> drivers/infiniband/hw/mlx5/odp.c | 770
>> +++++++++++++++++++++++++
>> drivers/infiniband/hw/mlx5/qp.c | 198 +++++--
>> drivers/infiniband/hw/nes/nes_verbs.c | 4 +-
>> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
>> drivers/infiniband/hw/qib/qib_mr.c | 2 +-
>> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 11 +-
>> drivers/net/ethernet/mellanox/mlx5/core/fw.c | 35 +-
>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
>> drivers/net/ethernet/mellanox/mlx5/core/qp.c | 134 ++++-
>> include/linux/mlx5/device.h | 73 ++-
>> include/linux/mlx5/driver.h | 20 +-
>> include/linux/mlx5/qp.h | 63 ++
>> include/rdma/ib_umem.h | 29 +-
>> include/rdma/ib_umem_odp.h | 156 +++++
>> include/rdma/ib_verbs.h | 47 +-
>> include/uapi/rdma/ib_user_verbs.h | 25 +
>> 32 files changed, 2907 insertions(+), 165 deletions(-)
>> create mode 100644 drivers/infiniband/core/umem_odp.c
>> create mode 100644 drivers/infiniband/core/umem_rbtree.c
>> create mode 100644 drivers/infiniband/hw/mlx5/odp.c
>> create mode 100644 include/rdma/ib_umem_odp.h
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists