linux-kernel - RE: Enabling peer to peer device transactions for PCIe devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR12MB1694CF1A306B61BEEBFC77EFF78C0@MWHPR12MB1694.namprd12.prod.outlook.com>
Date:   Wed, 30 Nov 2016 17:10:53 +0000
From:   "Deucher, Alexander" <Alexander.Deucher@....com>
To:     'Haggai Eran' <haggaie@...lanox.com>,
        Jason Gunthorpe <jgunthorpe@...idianresearch.com>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        "linux-nvdimm@...1.01.org" <linux-nvdimm@...1.01.org>,
        "Koenig, Christian" <Christian.Koenig@....com>,
        "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@....com>,
        "Bridgman, John" <John.Bridgman@....com>,
        "Linux-media@...r.kernel.org" <Linux-media@...r.kernel.org>,
        "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        "logang@...tatee.com" <logang@...tatee.com>,
        "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
        Max Gurtovoy <maxg@...lanox.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "Sagalovitch, Serguei" <Serguei.Sagalovitch@....com>,
        "Blinzer, Paul" <Paul.Blinzer@....com>,
        "Kuehling, Felix" <Felix.Kuehling@....com>,
        "Sander, Ben" <ben.sander@....com>
Subject: RE: Enabling peer to peer device transactions for PCIe devices

> -----Original Message-----
> From: Haggai Eran [mailto:haggaie@...lanox.com]
> Sent: Wednesday, November 30, 2016 5:46 AM
> To: Jason Gunthorpe
> Cc: linux-kernel@...r.kernel.org; linux-rdma@...r.kernel.org; linux-
> nvdimm@...1.01.org; Koenig, Christian; Suthikulpanit, Suravee; Bridgman,
> John; Deucher, Alexander; Linux-media@...r.kernel.org;
> dan.j.williams@...el.com; logang@...tatee.com; dri-
> devel@...ts.freedesktop.org; Max Gurtovoy; linux-pci@...r.kernel.org;
> Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> >>>> user-space and the GPU not to migrate it. If they do, the MR gets
> >>>> destroyed immediately.
> >>> That sounds horrible. How can that possibly work? What if the MR is
> >>> being used when the GPU decides to migrate?
> >> Naturally this doesn't support migration. The GPU is expected to pin
> >> these pages as long as the MR lives. The MR invalidation is done only as
> >> a last resort to keep system correctness.
> >
> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.
> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

Christian sent out an RFC patch a while back that extended the OOM to cover memory allocated for the GPU:
https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html

Alex

> 
> >
> >> I think it is similar to how non-ODP MRs rely on user-space today to
> >> keep them correct. If you do something like madvise(MADV_DONTNEED)
> on a
> >> non-ODP MR's pages, you can still get yourself into a data corruption
> >> situation (HCA sees one page and the process sees another for the same
> >> virtual address). The pinning that we use only guarentees the HCA's page
> >> won't be reused.
> >
> > That is not really data corruption - the data still goes where it was
> > originally destined. That is an application violating the
> > requirements of a MR.
> I guess it is a matter of terminology. If you compare it to the ODP case
> or the CPU case then you usually expect a single virtual address to map to
> a single physical page. Violating this cause some of your writes to be dropped
> which is a data corruption in my book, even if the application caused it.
> 
> > An application cannot munmap/mremap a VMA
> > while a non ODP MR points to it and then keep using the MR.
> Right. And it is perfectly fine to have some similar requirements from the
> application
> when doing peer to peer with a non-ODP MR.
> 
> > That is totally different from a GPU driver wanthing to mess with
> > translation to physical pages.
> >
> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?
> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer, and O_DIRECT operations that access GPU memory.
> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel,
> although that is intended to be handled by the GPU driver if I understand
> correctly.
> 
> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API
> support
> for peer to peer. I'm not sure we need vmap. We need a way to have a
> scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.
> 
> Haggai