linux-kernel - Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Date:   Thu, 6 Apr 2017 00:59:50 -0400
From:   Jerome Glisse <jglisse@...hat.com>
To:     "Figo.zhang" <figo1802@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        John Hubbard <jhubbard@...dia.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        David Nellans <dnellans@...dia.com>
Subject: Re: [HMM 00/16] HMM (Heterogeneous Memory Management) v19

On Thu, Apr 06, 2017 at 11:22:12AM +0800, Figo.zhang wrote:

[...]

> > Heterogeneous Memory Management (HMM) (description and justification)
> >
> > Today device driver expose dedicated memory allocation API through their
> > device file, often relying on a combination of IOCTL and mmap calls. The
> > device can only access and use memory allocated through this API. This
> > effectively split the program address space into object allocated for the
> > device and useable by the device and other regular memory (malloc, mmap
> > of a file, share memory, …) only accessible by CPU (or in a very limited
> > way by a device by pinning memory).
> >
> > Allowing different isolated component of a program to use a device thus
> > require duplication of the input data structure using device memory
> > allocator. This is reasonable for simple data structure (array, grid,
> > image, …) but this get extremely complex with advance data structure
> > (list, tree, graph, …) that rely on a web of memory pointers. This is
> > becoming a serious limitation on the kind of work load that can be
> > offloaded to device like GPU.
> >
> 
> how handle it by current  GPU software stack? maintain a complex middle
> firmwork/HAL?

Yes you still need a framework like OpenCL or CUDA. They are work under
way to leverage GPU directly from language like C++, so i expect that
the HAL will be hidden more and more for a larger group of programmer.
Note i still expect some programmer will want to program closer to the
hardware to extract every bit of performances they can.

For OpenCL you need HMM to implement what is described as fine-grained
system SVM memory model (see OpenCL 2.0 or latter specification).

> > New industry standard like C++, OpenCL or CUDA are pushing to remove this
> > barrier. This require a shared address space between GPU device and CPU so
> > that GPU can access any memory of a process (while still obeying memory
> > protection like read only).
> 
> GPU can access the whole process VMAs or any VMAs which backing system
> memory has migrate to GPU page table?

Whole process VMAs, it does not need to be migrated to device memory. The
migration is an optional features that is necessary for performances but
GPU can access system memory just fine.

[...]

> > When page backing an address of a process is migrated to device memory
> > the CPU page table entry is set to a new specific swap entry. CPU access
> > to such address triggers a migration back to system memory, just like if
> > the page was swap on disk. HMM also blocks any one from pinning a
> > ZONE_DEVICE page so that it can always be migrated back to system memory
> > if CPU access it. Conversely HMM does not migrate to device memory any
> > page that is pin in system memory.
> >
> 
> the purpose of  migrate the system pages to device is that device can read
> the system memory?
> if the CPU/programs want read the device data, it need pin/mapping the
> device memory to the process address space?
> if multiple applications want to read the same device memory region
> concurrently, how to do it?

Purpose of migrating to device memory is to leverage device memory bandwidth.
PCIE bandwidth 32GB/s, device memory bandwidth between 256GB/s to 1TB/s also
device bandwidth has smaller latency.

CPU can not access device memory. It can but in limited way on PCIE and it
would violate memory model programmer get for regular system memory hence
for all intents and purposes it is better to say that CPU can not access
any of the device memory.

Share VMA will just work, so if a VMA is share between 2 process than both
process can access the same memory. All the semantics that are valid on the
CPU are also valid on the GPU. Nothing change there.

> it is better a graph to show how CPU and GPU share the address space.

I am not good at making ASCII graph, nor would i know how to graph this.
Any valid address on the CPU is valid on the GPU, that's it really. The
migration to device memory is orthogonal to the share address space.

Cheers,
Jérôme