linux-kernel - Re: Enabling peer to peer device transactions for PCIe devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a8a6582-f3de-5cda-0c6e-1c93774147e0@amd.com>
Date:   Wed, 23 Nov 2016 09:51:14 +0100
From:   Christian König <christian.koenig@....com>
To:     Dan Williams <dan.j.williams@...el.com>,
        Serguei Sagalovitch <serguei.sagalovitch@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "linux-nvdimm@...ts.01.org" <linux-nvdimm@...ts.01.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "Kuehling, Felix" <Felix.Kuehling@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
        "Sander, Ben" <ben.sander@....com>,
        "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@....com>,
        "Deucher, Alexander" <Alexander.Deucher@....com>,
        "Blinzer, Paul" <Paul.Blinzer@....com>,
        "Linux-media@...r.kernel.org" <Linux-media@...r.kernel.org>
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:
> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <daniel@...ll.ch> wrote:
>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>>> <serguei.sagalovitch@....com> wrote:
>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.williams@...el.com>
>>>>> wrote:
>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>>> <serguei.sagalovitch@....com> wrote:
>>>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>>>
>>>>>>> -  How well it will co-exists with the  DRM infrastructure /
>>>>>>> implementations
>>>>>>>      in part dealing with CPU pointers?
>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>>>> not migratable. To be honest I do not know how well that co-exists
>>>>>> with drm infrastructure.
>>>>>>
>>>>>>> -  How well we will be able to handle case when we need to
>>>>>>> "move"/"evict"
>>>>>>>      memory/data to the new location so CPU pointer should point to the
>>>>>>> new
>>>>>>> physical location/address
>>>>>>>       (and may be not in PCI device memory at all)?
>>>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>>>> interface is for cases where all that is needed is a direct-mapping to
>>>>>> a statically-allocated physical-address range be it persistent memory
>>>>>> or some other special reserved memory range.
>>>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>>>> overcommit. At least that seems to be the very strong drive in all
>>>>> general-purpose gpu abstractions and implementations, where memory is
>>>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>>>> space through some magic,
>>>> It is possible that there is other way around: memory is requested to be
>>>> allocated and should be kept in vram for  performance reason but due
>>>> to possible overcommit case we need at least temporally to "move" such
>>>> allocation to system memory.
>>> With migration I meant migrating both ways of course. And with stuff
>>> like numactl we can also influence where exactly the malloc'ed memory
>>> is allocated originally, at least if we'd expose the vram range as a
>>> very special numa node that happens to be far away and not hold any
>>> cpu cores.
>> I don't think we should be using numa distance to reverse engineer a
>> certain allocation behavior.  The latency data should be truthful, but
>> you're right we'll need a mechanism to keep general purpose
>> allocations out of that range by default. Btw, strict isolation is
>> another design point of device-dax, but I think in this case we're
>> describing something between the two extremes of full isolation and
>> full compatibility with existing numactl apis.
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
>
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .

Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.

It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.

So even when numa nodes expose a good foundation for this I think there 
is still a lot of code to write.

BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?

Regards,
Christian.

> Wrt isolation: There's a sliding scale of what different users expect,
> from full auto everything, including migrating pages around if needed to
> full isolation all seems to be on the table. As long as we keep vram nodes
> out of any default allocation numasets, full isolation should be possible.
> -Daniel