linux-ext4 - Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e6a9e3cc-9dca-946a-c3fc-f86753fe8fd4@amd.com>
Date:   Wed, 1 Sep 2021 19:07:34 -0400
From:   Felix Kuehling <felix.kuehling@....com>
To:     Dave Chinner <david@...morbit.com>
Cc:     Christoph Hellwig <hch@....de>,
        "Sierra Guiza, Alejandro (Alex)" <alex.sierra@....com>,
        akpm@...ux-foundation.org, linux-mm@...ck.org,
        rcampbell@...dia.com, linux-ext4@...r.kernel.org,
        linux-xfs@...r.kernel.org, amd-gfx@...ts.freedesktop.org,
        dri-devel@...ts.freedesktop.org, jgg@...dia.com, jglisse@...hat.com
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration

On 2021-09-01 6:03 p.m., Dave Chinner wrote:
> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>> it's something we need to do in the migration helpers.
>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>> then.  Why do we need any special treatment at all for memory that
>>>>> has normal struct pages and is part of the direct kernel map?
>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>> page tables and for coherent access from the CPU.
>>> That's the user page tables.  What about the kernel direct map?
>>> If there is a normal kernel struct page backing there really should
>>> be no need for the pgmap.
>> I'm not sure. The physical address ranges are in the UEFI system address
>> map as special-purpose memory. Does Linux create the struct pages and
>> kernel direct map for that without a pgmap call? I didn't see that last
>> time I went digging through that code.
>>
>>
>>>>  From an application
>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>> to migrate data back-and-forth between system memory and device memory.
>>> I don't really understand that part.  file backed pages are always
>>> allocated by the file system using the pagecache helpers, that is
>>> using the page allocator.  Anonymouns memory also always comes from
>>> the page allocator.
>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>> memory by the migrate_vma_* helpers for more efficient access by our
>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>
>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>> currently work, but that's something I'm hoping to fix at some point.
> FWIW, I'd love to see the architecture documents that define how
> filesystems are supposed to interact with this device private
> memory. This whole "hand filesystem controlled memory to other
> devices" is a minefield that is trivial to get wrong iand very
> difficult to fix - just look at the historical mess that RDMA
> to/from file backed and/or DAX pages has been.
>
> So, really, from my perspective as a filesystem engineer, I want to
> see an actual specification for how this new memory type is going to
> interact with filesystem and the page cache so everyone has some
> idea of how this is going to work and can point out how it doesn't
> work before code that simply doesn't work is pushed out into
> production systems and then merged....

OK. To be clear, that's not part of this patch series. And I have no 
authority to push anything in this part of the kernel, so you have 
nothing to fear. ;)

FWIW, we already have the ability to map file-backed system memory pages 
into device page tables with HMM and interval notifiers. But we cannot 
currently migrate them to ZONE_DEVICE pages. Beyond that, my 
understanding of how filesystems and page cache work is rather 
superficial at this point. I'll keep your name in mind for when I am 
ready to discuss this in more detail.

Cheers,
   Felix


>
> Cheers,
>
> Dave.