linux-kernel - Re: [PATCH 00/15] HMM (Heterogeneous Memory Management) v24

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <960ef002-3cfd-5b91-054e-aa685abc5f1f@nvidia.com>
Date:   Thu, 29 Jun 2017 22:32:49 -0700
From:   John Hubbard <jhubbard@...dia.com>
To:     Jérôme Glisse <jglisse@...hat.com>,
        <akpm@...ux-foundation.org>, <linux-kernel@...r.kernel.org>,
        <linux-mm@...ck.org>
CC:     Dan Williams <dan.j.williams@...el.com>,
        David Nellans <dnellans@...dia.com>
Subject: Re: [PATCH 00/15] HMM (Heterogeneous Memory Management) v24

On 06/28/2017 11:00 AM, Jérôme Glisse wrote:
> 
> Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> test same kernel as kbuild system, git branch:
> 
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v24
> 
> Change since v23 is code comment fixes, simplify kernel configuration and
> improve allocation of new page on migration do device memory (last patch
> in this patchset).

Hi Jerome,

Tiny note: one more change is that hmm_devmem_fault_range() has been
removed (and thanks for taking care of that, btw).

Anyway, this looks good. A basic smoke test shows the following:

1. We definitely *require* your other patch, 
"[PATCH] x86/mm/hotplug: fix BUG_ON() after hotremove by not freeing pud v3",
otherwise I will reliably hit that bug every time I run my simple page fault
test. So, let me know if I should ping that thread. It looks like your patch
was not rejected, but I can't tell if (!rejected == accepted), there. :)

We'll continue testing, but I expect at this point that anything we find
can be patched up after HMM finally gets merged.

thanks,
John Hubbard
NVIDIA

> 
> Everything else is the same. Below is the long description of what HMM
> is about and why. At the end of this email i describe briefly each patch
> and suggest reviewers for each of them.
> 
> 
> Heterogeneous Memory Management (HMM) (description and justification)
> 
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, â) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
> 
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, â) but this get extremely complex with advance data structure
> (list, tree, graph, â) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
> 
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only). This kind of feature is also appearing in
> various other operating systems.
> 
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism
> that rely on pining pages use by a device, HMM relies on mmu_notifier to
> propagate CPU page table update to device page table.
> 
> Duplicating CPU page table is only one aspect necessary for efficiently
> using device like GPU. GPU local memory have bandwidth in the TeraBytes/
> second range but they are connected to main memory through a system bus
> like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
> is necessary to allow migration of process memory from main system memory
> to device memory. Issue is that on platform that only have PCIE the device
> memory is not accessible by the CPU with the same properties as main
> memory (cache coherency, atomic operations, ...).
> 
> To allow migration from main memory to device memory HMM provides a set
> of helper to hotplug device memory as a new type of ZONE_DEVICE memory
> which is un-addressable by CPU but still has struct page representing it.
> This allow most of the core kernel logic that deals with a process memory
> to stay oblivious of the peculiarity of device memory.
> 
> When page backing an address of a process is migrated to device memory
> the CPU page table entry is set to a new specific swap entry. CPU access
> to such address triggers a migration back to system memory, just like if
> the page was swap on disk. HMM also blocks any one from pinning a
> ZONE_DEVICE page so that it can always be migrated back to system memory
> if CPU access it. Conversely HMM does not migrate to device memory any
> page that is pin in system memory.
> 
> To allow efficient migration between device memory and main memory a new
> migrate_vma() helpers is added with this patchset. It allows to leverage
> device DMA engine to perform the copy operation.
> 
> This feature will be use by upstream driver like nouveau mlx5 and probably
> other in the future (amdgpu is next suspect in line). We are actively
> working on nouveau and mlx5 support. To test this patchset we also worked
> with NVidia close source driver team, they have more resources than us to
> test this kind of infrastructure and also a bigger and better userspace
> eco-system with various real industry workload they can be use to test and
> profile HMM.
> 
> The expected workload is a program builds a data set on the CPU (from disk,
> from network, from sensors, â). Program uses GPU API (OpenCL, CUDA, ...)
> to give hint on memory placement for the input data and also for the output
> buffer. Program call GPU API to schedule a GPU job, this happens using
> device driver specific ioctl. All this is hidden from programmer point of
> view in case of C++ compiler that transparently offload some part of a
> program to GPU. Program can keep doing other stuff on the CPU while the
> GPU is crunching numbers.
> 
> It is expected that CPU will not access the same data set as the GPU while
> GPU is working on it, but this is not mandatory. In fact we expect some
> small memory object to be actively access by both GPU and CPU concurrently
> as synchronization channel and/or for monitoring purposes. Such object will
> stay in system memory and should not be bottlenecked by system bus
> bandwidth (rare write and read access from both CPU and GPU).
> 
> As we are relying on device driver API, HMM does not introduce any new
> syscall nor does it modify any existing ones. It does not change any POSIX
> semantics or behaviors. For instance the child after a fork of a process
> that is using HMM will not be impacted in anyway, nor is there any data
> hazard between child COW or parent COW of memory that was migrated to
> device prior to fork.
> 
> HMM assume a numbers of hardware features. Device must allow device page
> table to be updated at any time (ie device job must be preemptable). Device
> page table must provides memory protection such as read only. Device must
> track write access (dirty bit). Device must have a minimum granularity that
> match PAGE_SIZE (ie 4k).
> 
> 
> Reviewer (just hint):
> Patch 1  HMM documentation
> Patch 2  introduce core infrastructure and definition of HMM, pretty
>          small patch and easy to review
> Patch 3  introduce the mirror functionality of HMM, it relies on
>          mmu_notifier and thus someone familiar with that part would be
>          in better position to review
> Patch 4  is an helper to snapshot CPU page table while synchronizing with
>          concurrent page table update. Understanding mmu_notifier makes
>          review easier.
> Patch 5  is mostly a wrapper around handle_mm_fault()
> Patch 6  add new add_pages() helper to avoid modifying each arch memory
>          hot plug function
> Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
>          in various core mm to support this new type. Dan Williams and
>          any core mm contributor are best people to review each half of
>          this patchset
> Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
>          Dan Williams are best person to review this
> Patch 9  add helper to hotplug un-addressable device memory as new type
>          of ZONE_DEVICE memory (new type introducted in patch 3 of this
>          serie). This is boiler plate code around memory hotplug and it
>          also pick a free range of physical address for the device memory.
>          Note that the physical address do not point to anything (at least
>          as far as the kernel knows).
> Patch 10 introduce a new hmm_device class as an helper for device driver
>          that want to expose multiple device memory under a common fake
>          device driver. This is usefull for multi-gpu configuration.
>          Anyone familiar with device driver infrastructure can review
>          this. Boiler plate code really.
> Patch 11 add a new migrate mode. Any one familiar with page migration is
>          welcome to review.
> Patch 12 introduce a new migration helper (migrate_vma()) that allow to
>          migrate a range of virtual address of a process using device DMA
>          engine to perform the copy. It is not limited to do copy from and
>          to device but can also do copy between any kind of source and
>          destination memory. Again anyone familiar with migration code
>          should be able to verify the logic.
> Patch 13 optimize the new migrate_vma() by unmapping pages while we are
>          collecting them. This can be review by any mm folks.
> Patch 14 add unaddressable memory migration to helper introduced in patch
>          7, this can be review by anyone familiar with migration code
> Patch 15 add a feature that allow device to allocate non-present page on
>          the GPU when migrating a range of address to device memory. This
>          is an helper for device driver to avoid having to first allocate
>          system memory before migration to device memory
> 
> 
> Previous patchset posting :
> v1 http://lwn.net/Articles/597289/
> v2 https://lkml.org/lkml/2014/6/12/559
> v3 https://lkml.org/lkml/2014/6/13/633
> v4 https://lkml.org/lkml/2014/8/29/423
> v5 https://lkml.org/lkml/2014/11/3/759
> v6 http://lwn.net/Articles/619737/
> v7 http://lwn.net/Articles/627316/
> v8 https://lwn.net/Articles/645515/
> v9 https://lwn.net/Articles/651553/
> v10 https://lwn.net/Articles/654430/
> v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
> v12 http://www.kernelhub.org/?msg=972982&p=2
> v13 https://lwn.net/Articles/706856/
> v14 https://lkml.org/lkml/2016/12/8/344
> v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
> v16 http://www.spinics.net/lists/linux-mm/msg119814.html
> v17 https://lkml.org/lkml/2017/1/27/847
> v18 https://lkml.org/lkml/2017/3/16/596
> v19 https://lkml.org/lkml/2017/4/5/831
> v20 https://lwn.net/Articles/720715/
> v21 https://lkml.org/lkml/2017/4/24/747
> v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
> 
> 
> Jérôme Glisse (14):
>   hmm: heterogeneous memory management documentation v2
>   mm/hmm: heterogeneous memory management (HMM for short) v4
>   mm/hmm/mirror: mirror process address space on device with HMM helpers
>     v3
>   mm/hmm/mirror: helper to snapshot CPU page table v3
>   mm/hmm/mirror: device page fault handler
>   mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v4
>   mm/ZONE_DEVICE: special case put_page() for device private pages v2
>   mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v6
>   mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
>   mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
>   mm/migrate: new memory migration helper for use with device memory v4
>   mm/migrate: migrate_vma() unmap page from vma while collecting pages
>   mm/migrate: support un-addressable ZONE_DEVICE page in migration v2
>   mm/migrate: allow migrate_vma() to alloc new page on empty entry v3
> 
> Michal Hocko (1):
>   mm/memory_hotplug: introduce add_pages
> 
>  Documentation/vm/hmm.txt       |  344 ++++++++++++
>  MAINTAINERS                    |    7 +
>  arch/x86/Kconfig               |    4 +
>  arch/x86/mm/init_64.c          |   22 +-
>  fs/aio.c                       |    8 +
>  fs/f2fs/data.c                 |    5 +-
>  fs/hugetlbfs/inode.c           |    5 +-
>  fs/proc/task_mmu.c             |    7 +
>  fs/ubifs/file.c                |    5 +-
>  include/linux/hmm.h            |  458 +++++++++++++++
>  include/linux/ioport.h         |    1 +
>  include/linux/memory_hotplug.h |   11 +
>  include/linux/memremap.h       |   86 +++
>  include/linux/migrate.h        |  124 +++++
>  include/linux/migrate_mode.h   |    5 +
>  include/linux/mm.h             |   25 +
>  include/linux/mm_types.h       |    6 +
>  include/linux/swap.h           |   24 +-
>  include/linux/swapops.h        |   68 +++
>  kernel/fork.c                  |    2 +
>  kernel/memremap.c              |   53 +-
>  mm/Kconfig                     |   34 ++
>  mm/Makefile                    |    2 +-
>  mm/balloon_compaction.c        |    8 +
>  mm/hmm.c                       | 1193 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory.c                    |   61 ++
>  mm/memory_hotplug.c            |   10 +-
>  mm/migrate.c                   |  806 ++++++++++++++++++++++++++-
>  mm/mprotect.c                  |   14 +
>  mm/page_vma_mapped.c           |   10 +
>  mm/rmap.c                      |   25 +
>  mm/zsmalloc.c                  |    8 +
>  32 files changed, 3411 insertions(+), 30 deletions(-)
>  create mode 100644 Documentation/vm/hmm.txt
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
>