linux-kernel - Re: [v7 00/16] mm: support device-private THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR6szF6Ga8K2ZRjR@lstrano-desk.jf.intel.com>
Date: Wed, 19 Nov 2025 21:53:16 -0800
From: Matthew Brost <matthew.brost@...el.com>
To: Balbir Singh <balbirs@...dia.com>
CC: Andrew Morton <akpm@...ux-foundation.org>, <linux-kernel@...r.kernel.org>,
	<dri-devel@...ts.freedesktop.org>, <linux-mm@...ck.org>, David Hildenbrand
	<david@...hat.com>, Zi Yan <ziy@...dia.com>, Joshua Hahn
	<joshua.hahnjy@...il.com>, Rakie Kim <rakie.kim@...com>, Byungchul Park
	<byungchul@...com>, Gregory Price <gourry@...rry.net>, Ying Huang
	<ying.huang@...ux.alibaba.com>, Alistair Popple <apopple@...dia.com>, "Oscar
 Salvador" <osalvador@...e.de>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>, "Liam R. Howlett"
	<Liam.Howlett@...cle.com>, Nico Pache <npache@...hat.com>, Ryan Roberts
	<ryan.roberts@....com>, Dev Jain <dev.jain@....com>, Barry Song
	<baohua@...nel.org>, Lyude Paul <lyude@...hat.com>, Danilo Krummrich
	<dakr@...nel.org>, David Airlie <airlied@...il.com>, Simona Vetter
	<simona@...ll.ch>, Ralph Campbell <rcampbell@...dia.com>, Mika
 Penttilä <mpenttil@...hat.com>, Francois Dugast
	<francois.dugast@...el.com>
Subject: Re: [v7 00/16] mm: support device-private THP

On Thu, Nov 20, 2025 at 02:58:58PM +1100, Balbir Singh wrote:
> On 11/20/25 14:15, Matthew Brost wrote:
> > On Thu, Nov 20, 2025 at 01:59:09PM +1100, Balbir Singh wrote:
> >> On 11/20/25 13:50, Balbir Singh wrote:
> >>> On 11/20/25 13:40, Matthew Brost wrote:
> >>>> On Wed, Nov 12, 2025 at 10:52:43AM +1100, Balbir Singh wrote:
> >>>>> On 11/12/25 10:43, Andrew Morton wrote:
> >>>>>> On Thu, 9 Oct 2025 03:33:33 -0700 Matthew Brost <matthew.brost@...el.com> wrote:
> >>>>>>
> >>>>>>>>>> This patch series introduces support for Transparent Huge Page
> >>>>>>>>>> (THP) migration in zone device-private memory. The implementation enables
> >>>>>>>>>> efficient migration of large folios between system memory and
> >>>>>>>>>> device-private memory
> >>>>>>>>>
> >>>>>>>>> Lots of chatter for the v6 series, but none for v7.  I hope that's a
> >>>>>>>>> good sign.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I hope so too, I've tried to address the comments in v6.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Circling back to this series, we will itegrate and test this version.
> >>>>>>
> >>>>>> How'd it go?
> >>>>>>
> >>>>
> >>>> My apologies for the delay—I got distracted by other tasks in Xe (my
> >>>> driver) and was out for a bit. Unfortunately, this series breaks
> >>>> something in the existing core MM code for the Xe SVM implementation. I
> >>>> have an extensive test case that hammers on SVM, which fully passes
> >>>> prior to applying this series, but fails randomly with the series
> >>>> applied (to drm-tip-rc6) due to the below kernel lockup.
> >>>>
> >>>> I've tried to trace where the migration PTE gets installed but not
> >>>> removed or isolate a test case which causes this failure but no luck so
> >>>> far. I'll keep digging as I have time.
> >>>>
> >>>> Beyond that, if I enable Xe SVM + THP, it seems to mostly work (though
> >>>> the same issue as above eventually occurs), but I do need two additional
> >>>> core MM patches—one is new code required for Xe, and the other could be
> >>>> considered a bug fix. Those patches can included when Xe merges SVM THP
> >>>> support but we need at least not break Xe SVM before this series merges.
> >>>>
> >>>> Stack trace:
> >>>>
> >>>> INFO: task kworker/u65:2:1642 blocked for more than 30
> >>>> seconds.
> >>>> [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> >>>> [  212.630561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>> disables this message.
> >>>> [  212.638285] task:kworker/u65:2   state:D stack:0     pid:1642
> >>>> tgid:1642  ppid:2      task_flags:0x4208060 flags:0x00080000
> >>>> [  212.638288] Workqueue: xe_page_fault_work_queue
> >>>> xe_pagefault_queue_work [xe]
> >>>> [  212.638323] Call Trace:
> >>>> [  212.638324]  <TASK>
> >>>> [  212.638325]  __schedule+0x4b0/0x990
> >>>> [  212.638330]  schedule+0x22/0xd0
> >>>> [  212.638331]  io_schedule+0x41/0x60
> >>>> [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> >>>> [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> >>>> [  212.638339]  migration_entry_wait+0xd2/0xe0
> >>>> [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> >>>> [  212.638343]  walk_pgd_range+0x51d/0xa40
> >>>> [  212.638345]  __walk_page_range+0x75/0x1e0
> >>>> [  212.638347]  walk_page_range_mm+0x138/0x1f0
> >>>> [  212.638349]  hmm_range_fault+0x59/0xa0
> >>>> [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> >>>> [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> >>>> [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> >>>> [  212.638375]  ? update_load_avg+0x7f/0x6c0
> >>>> [  212.638377]  ? update_curr+0x13d/0x170
> >>>> [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> >>>> [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> >>>> [  212.638420]  process_one_work+0x16e/0x2e0
> >>>> [  212.638422]  worker_thread+0x284/0x410
> >>>> [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> >>>> [  212.638425]  kthread+0xec/0x210
> >>>> [  212.638427]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638428]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638430]  ret_from_fork+0xbd/0x100
> >>>> [  212.638433]  ? __pfx_kthread+0x10/0x10
> >>>> [  212.638434]  ret_from_fork_asm+0x1a/0x30
> >>>> [  212.638436]  </TASK>
> >>>>
> >>>
> >>> Hi, Matt
> >>>
> >>> Thanks for the report, two questions
> >>>
> >>> 1. Are you using mm/mm-unstable, we've got some fixes in there (including fixes to remove_migration_pmd())
> > 
> > remove_migration_pmd - This is a PTE migration entry.
> > 
> 
> I don't have your symbols, I thought we were hitting, the following condition in the walk
> 
> 	if (thp_migration_supported() && pmd_is_migration_entry(pmd)) {
> 
> But sounds like you are not, PMD/THP has not been enabled in this case
> 

No, migration_entry_wait rather than pmd_migration_entry_wait.

> 
> >>>    - Generally a left behind migration entry is a symptom of a failed migration that did not clean up
> >>>      after itself.
> > 
> > I'm on drm-tip as I generally need the latest version of my driver
> > because of the speed we move at.
> > 
> > Yes, I agree it looks like somehow a migration PTE is not getting
> > properly removed.
> > 
> > I'm happy to cherry pick any patches that you think might be helpful
> > into my tree.
> > 
> 
> Could you try the mm/mm-new tree with the current xe driver?
>

Unfortunately, this is a tough one. We land a lot of patches in Xe/DRM,
so bringing the driver up to date with an MM branch is difficult, and
I’m not an expert at merging branches. It would be nice if, in the DRM
flow, we could merge patches from outside our subsystem into a
bleeding-edge kernel for the things we typically care about—but we’d
need a maintainer to sign up for that.

> In general, w.r.t failure, I would check for the following
> 
> 1. Are the dst_pfns in migrate_vma_pages() setup correctly by the device driver?
> 2. Any failures in folio_migrate_mapping()?
> 3. In migrate_vma_finalize() check to see if remove_migration_ptes() failed
> 
> If (3) fails that will explain the left over migration entries
> 

Good tips, but think I got it via biscet.

Offending patch is:

'mm/migrate_device: handle partially mapped folios during collection'

The failing test case involves some remap-related issue. It’s a
parameterized test, so I honestly couldn’t tell you exactly what it’s
doing beyond the fact that it seems nonsensical but stresses remap. I
thought commit '66d81853fa3d selftests/mm/hmm-tests: partial unmap,
mremap and anon_write tests' would catch this, but it looks like I need
to make the remap HMM test cases a bit more robust—similar to my
driver-side tests. I can take an action item to follow up on this.

Good news, I can tell you how to fix this...

In 'mm/migrate_device: handle partially mapped folios during collection': 

109 +#if 0
110 +                       folio = page ? page_folio(page) : NULL;
111 +                       if (folio && folio_test_large(folio)) {
112 +                               int ret;
113 +
114 +                               pte_unmap_unlock(ptep, ptl);
115 +                               ret = migrate_vma_split_folio(folio,
116 +								  migrate->fault_page);
117 +
118 +                               if (ret) {
119 +                                       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
120 +                                       goto next;
121 +                               }
122 +
123 +                               addr = start;
124 +                               goto again;
125 +                       }
126 +#endif

You can probably just delete this and use my patch below, but if you
want to try fixing it with a quick look: if migrate_vma_split_folio
fails, you probably need to collect a hole. On success, you likely want
to continue executing the remainder of the loop. I can try playing with
this tomorrow, but it’s late here.

I had privately sent you a version of this patch as a fix for Xe, and
this one seems to work:

[PATCH] mm/migrate: Split THP found in middle of PMD during page collection

The migrate layer is not coded to handle a THP found in the middle of a
PMD. This can occur if a user manipulates mappings with mremap(). If a
THP is found mid-PMD during page collection, split it.

Cc: Balbir Singh <balbirs@...dia.com>
Signed-off-by: Matthew Brost <matthew.brost@...el.com>
---
 mm/migrate_device.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index abd9f6850db6..9ffc025bad50 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -65,6 +65,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        struct vm_area_struct *vma = walk->vma;
        struct mm_struct *mm = vma->vm_mm;
        unsigned long addr = start, unmapped = 0;
+       struct folio *split_folio = NULL;
        spinlock_t *ptl;
        pte_t *ptep;

@@ -107,10 +108,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                }
        }

-       ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+       ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
        if (!ptep)
                goto again;
        arch_enter_lazy_mmu_mode();
+       ptep += (addr - start) / PAGE_SIZE;

        for (; addr < end; addr += PAGE_SIZE, ptep++) {
                struct dev_pagemap *pgmap;
@@ -209,6 +211,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
                        bool anon_exclusive;
                        pte_t swp_pte;

+                       if (folio_order(folio)) {
+                               split_folio = folio;
+                               goto split;
+                       }
+
                        flush_cache_page(vma, addr, pte_pfn(pte));
                        anon_exclusive = folio_test_anon(folio) &&
                                          PageAnonExclusive(page);
@@ -287,8 +294,34 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
        if (unmapped)
                flush_tlb_range(walk->vma, start, end);

+split:
        arch_leave_lazy_mmu_mode();
-       pte_unmap_unlock(ptep - 1, ptl);
+       pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
+
+       /*
+        * XXX: No clean way to support higher-order folios that don't match PMD
+        * boundaries for now — split them instead. Once mTHP support lands, add
+        * proper support for this case.
+        *
+        * The test, which exposed this as problematic, remapped (memremap) a
+        * large folio to an unaligned address, resulting in the folio being
+        * found in the middle of the PTEs. The requested number of pages was
+        * less than the folio size. Likely to be handled gracefully by upper
+        * layers eventually, but not yet.
+        */
+       if (split_folio) {
+               int ret;
+
+               ret = split_folio(split_folio);
+               if (fault_folio != split_folio)
+                       folio_unlock(split_folio);
+               folio_put(split_folio);
+               if (ret)
+                       return migrate_vma_collect_skip(addr, end, walk);
+
+               split_folio = NULL;
+               goto again;
+       }

        return 0;
 }

If I apply the #if 0 change along with my patch above (plus one core
MM patch needed for Xe that adds a support function), Xe SVM fully
passes our test cases with both THP enabled and disabled.

> >>> 2. The stack trace is from hmm_range_fault(), not something that this code touches.
> >>>
> > 
> > Agree this is a symptom of the above issue.
> > 
> >>> The stack trace shows your code is seeing a migration entry and waiting on it.
> >>> Can you please provide a reproducer for the issue? In the form of a test in hmm-tests.c
> >>>
> > 
> > That will be my plan. Right now I'm opening my test up which runs 1000s
> > of variations of SVM tests and the test that hangs is not consistent.
> > Some of these are threaded or multi-process so it might possibly be a
> > timing issue which could be hard to reproduce in hmm-tests.c. I'll do my
> > best here.
> > 
> >>> Have you been able to bisect the issue?
> >>
> > 
> > That is my next step along with isolating a test case.
> > 
> >> Also could you please try with 10b9feee2d0d ("mm/hmm: populate PFNs from PMD swap entry")
> >> reverted?
> >>
> > 
> > I can try but I highly doubt this is related. The hanging HMM code in is
> > PTE walk step after this, also I am not even enabling THP device pages
> > in my SVM code to reproduce this.
> > 
> 
> Thanks, do regular hmm-tests pass for you in that setup/environment?
> 

Yes. As noted above, I need to make the remap HMM case a bit more
robust. I’ll try to get to this before the Thanksgiving break in the US
(next Thursday-Friday).

Matt

> Balbir
> 
> > Matt
> > 
> >>>
> >>> Balbir
> >>>
> >>>
> >>>> Matt 
> >>>>
> >>>>>> Balbir, what's the status here?  It's been a month and this series
> >>>>>> still has a "needs a new version" feeling to it.  If so, very soon
> >>>>>> please.
> >>>>>>
> >>>>>
> >>>>> I don't think this needs a new revision, I've been testing frequently
> >>>>> at my end to see if I can catch any regressions. I have a patch update for
> >>>>> mm-migrate_device-add-thp-splitting-during-migration.patch, it can be applied
> >>>>> on top or I can send a new version of the patch. I was waiting
> >>>>> on any feedback before I sent the patch out, but I'll do it now.
> >>>>>
> >>>>>> TODOs which I have noted are
> >>>>>>
> >>>>>> https://lkml.kernel.org/r/aOePfeoDuRW+prFq@lstrano-desk.jf.intel.com
> >>>>>
> >>>>> This was a clarification on the HMM patch mentioned in the changelog
> >>>>>
> >>>>>> https://lkml.kernel.org/r/CABzRoyZZ8QLF5PSeDCVxgcnQmF9kFQ3RZdNq0Deik3o9OrK+BQ@mail.gmail.com
> >>>>>
> >>>>> That's a minor comment on not using a temporary declaration, I don't think we need it, let me know if you feel strongly
> >>>>>
> >>>>>> https://lkml.kernel.org/r/D2A4B724-E5EF-46D3-9D3F-EBAD9B22371E@nvidia.com
> >>>>>
> >>>>> I have a patch for this, which I posted, I can do an update and resend it if required (the one mentioned above)
> >>>>>
> >>>>>> https://lkml.kernel.org/r/62073ca1-5bb6-49e8-b8d4-447c5e0e582e@
> >>>>>>
> >>>>>
> >>>>> I can't seem to open this
> >>>>>
> >>>>>> plus a general re-read of the
> >>>>>> mm-migrate_device-add-thp-splitting-during-migration.patch review
> >>>>>> discussion.
> >>>>>>
> >>>>> That's the patch I have
> >>>>>
> >>>>> Thanks for following up
> >>>>> Balbir
> >>>
> >>
>