[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250724064732.GQ402218@unreal>
Date: Thu, 24 Jul 2025 09:47:32 +0300
From: Leon Romanovsky <leon@...nel.org>
To: Matthew Brost <matthew.brost@...el.com>,
Mika Penttilä <mpenttil@...hat.com>
Cc: Francois Dugast <francois.dugast@...el.com>, airlied@...il.com,
akpm@...ux-foundation.org, apopple@...dia.com, baohua@...nel.org,
baolin.wang@...ux.alibaba.com, dakr@...nel.org, david@...hat.com,
donettom@...ux.ibm.com, jane.chu@...cle.com, jglisse@...hat.com,
kherbst@...hat.com, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, lyude@...hat.com, peterx@...hat.com,
ryan.roberts@....com, shuah@...nel.org, simona@...ll.ch,
wangkefeng.wang@...wei.com, willy@...radead.org, ziy@...dia.com,
Balbir Singh <balbirs@...dia.com>, jgg@...dia.com
Subject: Re: [PATCH] mm/hmm: Do not fault in device private pages owned by
the caller
On Thu, Jul 24, 2025 at 09:04:36AM +0300, Mika Penttilä wrote:
>
> On 7/24/25 08:57, Matthew Brost wrote:
> > On Thu, Jul 24, 2025 at 08:46:11AM +0300, Mika Penttilä wrote:
> >> On 7/24/25 08:02, Matthew Brost wrote:
> >>> On Thu, Jul 24, 2025 at 10:25:11AM +1000, Balbir Singh wrote:
> >>>> On 7/23/25 05:34, Francois Dugast wrote:
> >>>>> When the PMD swap entry is device private and owned by the caller,
> >>>>> skip the range faulting and instead just set the correct HMM PFNs.
> >>>>> This is similar to the logic for PTEs in hmm_vma_handle_pte().
> >>>>>
> >>>>> For now, each hmm_pfns[i] entry is populated as it is currently done
> >>>>> in hmm_vma_handle_pmd() but this might not be necessary. A follow-up
> >>>>> optimization could be to make use of the order and skip populating
> >>>>> subsequent PFNs.
> >>>> I think we should test and remove these now
> >>>>
> >>> +Jason, Leon – perhaps either of you can provide insight into why
> >>> hmm_vma_handle_pmd fully populates the HMM PFNs when a higher-order page
> >>> is found.
> >>>
> >>> If we can be assured that changing this won’t break other parts of the
> >>> kernel, I agree it should be removed. A snippet of documentation should
> >>> also be added indicating that when higher-order PFNs are found,
> >>> subsequent PFNs within the range will remain unpopulated. I can verify
> >>> that GPU SVM works just fine without these PFNs being populated.
> >> afaics the device can consume the range as smaller pages also, and some
> >> hmm users depend on that.
> >>
> > Sure, but I think that should be fixed in the device code. If a
> > large-order PFN is found, the subsequent PFNs can clearly be inferred.
> > It's a micro-optimization here, but devices or callers capable of
> > handling this properly shouldn't force a hacky, less optimal behavior on
> > core code. If anything relies on the current behavior, we should fix it
> > and ensure correctness.
>
> Yes sure device code can be changed but meant to say we can't just
> delete those lines without breaking existing users.
Mika is right. RDMA subsystem and HMM users there need to be updated.
We have special flag (IB_ACCESS_HUGETLB) that prepare whole RDMA stack
to handle large order PFNs. If this flag is not provided, we need to
fallback to basic device page size (4k) and for that we expect fully
populated PFN list.
Thanks
Powered by blists - more mailing lists