[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e64ee7c6-5113-4180-94e8-2fd7e711d5e2@redhat.com>
Date: Fri, 25 Jul 2025 11:51:54 +0200
From: David Hildenbrand <david@...hat.com>
To: Alistair Popple <apopple@...dia.com>
Cc: Jason Gunthorpe <jgg@...pe.ca>, Matthew Wilcox <willy@...radead.org>,
Yonatan Maman <ymaman@...dia.com>, Jérôme Glisse
<jglisse@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>,
Leon Romanovsky <leon@...nel.org>, Lyude Paul <lyude@...hat.com>,
Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Ben Skeggs <bskeggs@...dia.com>,
Michael Guralnik <michaelgur@...dia.com>, Or Har-Toov <ohartoov@...dia.com>,
Daisuke Matsuda <dskmtsd@...il.com>, Shay Drory <shayd@...dia.com>,
linux-mm@...ck.org, linux-rdma@...r.kernel.org,
dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Gal Shalom <GalShalom@...dia.com>
Subject: Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device
private pages
On 25.07.25 02:31, Alistair Popple wrote:
> On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote:
>> On 23.07.25 06:10, Alistair Popple wrote:
>>> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
>>>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
>>>>>> So what is it?
>>>>>
>>>>> IMHO a hack, because obviously we shouldn't require real physical addresses for
>>>>> something the CPU can't actually address anyway and this causes real
>>>>> problems
>>>>
>>>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
>>>> entries that point to some kind of opaque driver managed memory.
>>>>
>>>> We have alot of assumptions all over about pfn/phys to page
>>>> relationships so anything that has a struct page also has to come with
>>>> a fake PFN today..
>>>
>>> Hmm ... maybe. To get that PFN though we have to come from either a special
>>> swap entry which we already have special cases for, or a struct page (which is
>>> a device private page) which we mostly have to handle specially anyway. I'm not
>>> sure there's too many places that can sensibly handle a fake PFN without somehow
>>> already knowing it is device-private PFN.
>>>
>>>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason
>>>>> the "PFN" we store in device-private entries couldn't instead just be an index
>>>>> into some data structure holding pointers to the struct pages. So instead of
>>>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
>>>>> and page_to_device_private_index().
>>>>
>>>> It could work, but any of the pfn conversions would have to be tracked
>>>> down.. Could be troublesome.
>>>
>>> I looked at this a while back and I'm reasonably optimistic that this is doable
>>> because we already have to treat these specially everywhere anyway.
>> How would that look like?
>>
>> E.g., we have code like
>>
>> if (is_device_private_entry(entry)) {
>> page = pfn_swap_entry_to_page(entry);
>> folio = page_folio(page);
>>
>> ...
>> folio_get(folio);
>> ...
>> }
>>
>> We could easily stop allowing pfn_swap_entry_to_page(), turning these into
>> non-pfn swap entries.
>>
>> Would it then be something like
>>
>> if (is_device_private_entry(entry)) {
>> page = device_private_entry_to_page(entry);
>>
>> ...
>> }
>>
>> Whereby device_private_entry_to_page() obtains the "struct page" not via the
>> PFN but some other magical (index) value?
>
> Exactly. The observation being that when you convert a PTE from a swap entry
> to a page we already know it is a device private entry, so can go look up the
> struct page with special magic (eg. an index into some other array or data
> structure).
>
> And if you have a struct page you already know it's a device private page so if
> you need to create the swap entry you can look up the magic index using some
> alternate function.
>
> The only issue would be if there were generic code paths that somehow have a
> raw pfn obtained from neither a page-table walk or struct page. My assumption
> (yet to be proven/tested) is that these paths don't exist.
I guess memory compaction and friends don't apply to ZONE_DEVICE, and
even memory_failure() handling goes a separate path.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists