lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aba6cc29-3ce1-4897-8397-05e8c7b7f8f9@nvidia.com>
Date: Fri, 15 Aug 2025 22:09:16 +1000
From: Balbir Singh <balbirs@...dia.com>
To: Matthew Brost <matthew.brost@...el.com>
Cc: Mika Penttilä <mpenttil@...hat.com>,
 dri-devel@...ts.freedesktop.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, Zi Yan <ziy@...dia.com>,
 Joshua Hahn <joshua.hahnjy@...il.com>, Rakie Kim <rakie.kim@...com>,
 Byungchul Park <byungchul@...com>, Gregory Price <gourry@...rry.net>,
 Ying Huang <ying.huang@...ux.alibaba.com>,
 Alistair Popple <apopple@...dia.com>, Oscar Salvador <osalvador@...e.de>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Nico Pache <npache@...hat.com>,
 Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>,
 Barry Song <baohua@...nel.org>, Lyude Paul <lyude@...hat.com>,
 Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Ralph Campbell <rcampbell@...dia.com>,
 Francois Dugast <francois.dugast@...el.com>
Subject: Re: [v3 03/11] mm/migrate_device: THP migration of zone device pages

On 8/15/25 10:04, Matthew Brost wrote:
> On Fri, Aug 15, 2025 at 08:51:21AM +1000, Balbir Singh wrote:
>> On 8/13/25 10:07, Mika Penttilä wrote:
>>>
>>> On 8/13/25 02:36, Balbir Singh wrote:
>>>
>>>> On 8/12/25 15:35, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/12/25 05:40, Balbir Singh wrote:
>>>>>
>>>>>> MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
>>>>>> migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating
>>>>>> device pages as compound pages during device pfn migration.
>>>>>>
>>>>>> migrate_device code paths go through the collect, setup
>>>>>> and finalize phases of migration.
>>>>>>
>>>>>> The entries in src and dst arrays passed to these functions still
>>>>>> remain at a PAGE_SIZE granularity. When a compound page is passed,
>>>>>> the first entry has the PFN along with MIGRATE_PFN_COMPOUND
>>>>>> and other flags set (MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the
>>>>>> remaining entries (HPAGE_PMD_NR - 1) are filled with 0's. This
>>>>>> representation allows for the compound page to be split into smaller
>>>>>> page sizes.
>>>>>>
>>>>>> migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP
>>>>>> page aware. Two new helper functions migrate_vma_collect_huge_pmd()
>>>>>> and migrate_vma_insert_huge_pmd_page() have been added.
>>>>>>
>>>>>> migrate_vma_collect_huge_pmd() can collect THP pages, but if for
>>>>>> some reason this fails, there is fallback support to split the folio
>>>>>> and migrate it.
>>>>>>
>>>>>> migrate_vma_insert_huge_pmd_page() closely follows the logic of
>>>>>> migrate_vma_insert_page()
>>>>>>
>>>>>> Support for splitting pages as needed for migration will follow in
>>>>>> later patches in this series.
>>>>>>
>>>>>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>>>>>> Cc: David Hildenbrand <david@...hat.com>
>>>>>> Cc: Zi Yan <ziy@...dia.com>
>>>>>> Cc: Joshua Hahn <joshua.hahnjy@...il.com>
>>>>>> Cc: Rakie Kim <rakie.kim@...com>
>>>>>> Cc: Byungchul Park <byungchul@...com>
>>>>>> Cc: Gregory Price <gourry@...rry.net>
>>>>>> Cc: Ying Huang <ying.huang@...ux.alibaba.com>
>>>>>> Cc: Alistair Popple <apopple@...dia.com>
>>>>>> Cc: Oscar Salvador <osalvador@...e.de>
>>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>>>>>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>>>>>> Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>
>>>>>> Cc: Nico Pache <npache@...hat.com>
>>>>>> Cc: Ryan Roberts <ryan.roberts@....com>
>>>>>> Cc: Dev Jain <dev.jain@....com>
>>>>>> Cc: Barry Song <baohua@...nel.org>
>>>>>> Cc: Lyude Paul <lyude@...hat.com>
>>>>>> Cc: Danilo Krummrich <dakr@...nel.org>
>>>>>> Cc: David Airlie <airlied@...il.com>
>>>>>> Cc: Simona Vetter <simona@...ll.ch>
>>>>>> Cc: Ralph Campbell <rcampbell@...dia.com>
>>>>>> Cc: Mika Penttilä <mpenttil@...hat.com>
>>>>>> Cc: Matthew Brost <matthew.brost@...el.com>
>>>>>> Cc: Francois Dugast <francois.dugast@...el.com>
>>>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@...dia.com>
>>>>>> ---
>>>>>>  include/linux/migrate.h |   2 +
>>>>>>  mm/migrate_device.c     | 457 ++++++++++++++++++++++++++++++++++------
>>>>>>  2 files changed, 396 insertions(+), 63 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>>>> index acadd41e0b5c..d9cef0819f91 100644
>>>>>> --- a/include/linux/migrate.h
>>>>>> +++ b/include/linux/migrate.h
>>>>>> @@ -129,6 +129,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>>>>>  #define MIGRATE_PFN_VALID	(1UL << 0)
>>>>>>  #define MIGRATE_PFN_MIGRATE	(1UL << 1)
>>>>>>  #define MIGRATE_PFN_WRITE	(1UL << 3)
>>>>>> +#define MIGRATE_PFN_COMPOUND	(1UL << 4)
>>>>>>  #define MIGRATE_PFN_SHIFT	6
>>>>>>  
>>>>>>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
>>>>>> @@ -147,6 +148,7 @@ enum migrate_vma_direction {
>>>>>>  	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
>>>>>>  	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
>>>>>>  	MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
>>>>>> +	MIGRATE_VMA_SELECT_COMPOUND = 1 << 3,
>>>>>>  };
>>>>>>  
>>>>>>  struct migrate_vma {
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 0ed337f94fcd..6621bba62710 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -14,6 +14,7 @@
>>>>>>  #include <linux/pagewalk.h>
>>>>>>  #include <linux/rmap.h>
>>>>>>  #include <linux/swapops.h>
>>>>>> +#include <asm/pgalloc.h>
>>>>>>  #include <asm/tlbflush.h>
>>>>>>  #include "internal.h"
>>>>>>  
>>>>>> @@ -44,6 +45,23 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>>  	if (!vma_is_anonymous(walk->vma))
>>>>>>  		return migrate_vma_collect_skip(start, end, walk);
>>>>>>  
>>>>>> +	if (thp_migration_supported() &&
>>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>>> +		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
>>>>>> +						MIGRATE_PFN_COMPOUND;
>>>>>> +		migrate->dst[migrate->npages] = 0;
>>>>>> +		migrate->npages++;
>>>>>> +		migrate->cpages++;
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * Collect the remaining entries as holes, in case we
>>>>>> +		 * need to split later
>>>>>> +		 */
>>>>>> +		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>>> +	}
>>>>>> +
>>>>>>  	for (addr = start; addr < end; addr += PAGE_SIZE) {
>>>>>>  		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
>>>>>>  		migrate->dst[migrate->npages] = 0;
>>>>>> @@ -54,57 +72,151 @@ static int migrate_vma_collect_hole(unsigned long start,
>>>>>>  	return 0;
>>>>>>  }
>>>>>>  
>>>>>> -static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> -				   unsigned long start,
>>>>>> -				   unsigned long end,
>>>>>> -				   struct mm_walk *walk)
>>>>>> +/**
>>>>>> + * migrate_vma_collect_huge_pmd - collect THP pages without splitting the
>>>>>> + * folio for device private pages.
>>>>>> + * @pmdp: pointer to pmd entry
>>>>>> + * @start: start address of the range for migration
>>>>>> + * @end: end address of the range for migration
>>>>>> + * @walk: mm_walk callback structure
>>>>>> + *
>>>>>> + * Collect the huge pmd entry at @pmdp for migration and set the
>>>>>> + * MIGRATE_PFN_COMPOUND flag in the migrate src entry to indicate that
>>>>>> + * migration will occur at HPAGE_PMD granularity
>>>>>> + */
>>>>>> +static int migrate_vma_collect_huge_pmd(pmd_t *pmdp, unsigned long start,
>>>>>> +					unsigned long end, struct mm_walk *walk,
>>>>>> +					struct folio *fault_folio)
>>>>>>  {
>>>>>> +	struct mm_struct *mm = walk->mm;
>>>>>> +	struct folio *folio;
>>>>>>  	struct migrate_vma *migrate = walk->private;
>>>>>> -	struct folio *fault_folio = migrate->fault_page ?
>>>>>> -		page_folio(migrate->fault_page) : NULL;
>>>>>> -	struct vm_area_struct *vma = walk->vma;
>>>>>> -	struct mm_struct *mm = vma->vm_mm;
>>>>>> -	unsigned long addr = start, unmapped = 0;
>>>>>>  	spinlock_t *ptl;
>>>>>> -	pte_t *ptep;
>>>>>> +	swp_entry_t entry;
>>>>>> +	int ret;
>>>>>> +	unsigned long write = 0;
>>>>>>  
>>>>>> -again:
>>>>>> -	if (pmd_none(*pmdp))
>>>>>> +	ptl = pmd_lock(mm, pmdp);
>>>>>> +	if (pmd_none(*pmdp)) {
>>>>>> +		spin_unlock(ptl);
>>>>>>  		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +	}
>>>>>>  
>>>>>>  	if (pmd_trans_huge(*pmdp)) {
>>>>>> -		struct folio *folio;
>>>>>> -
>>>>>> -		ptl = pmd_lock(mm, pmdp);
>>>>>> -		if (unlikely(!pmd_trans_huge(*pmdp))) {
>>>>>> +		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
>>>>>>  			spin_unlock(ptl);
>>>>>> -			goto again;
>>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>>>  		}
>>>>>>  
>>>>>>  		folio = pmd_folio(*pmdp);
>>>>>>  		if (is_huge_zero_folio(folio)) {
>>>>>>  			spin_unlock(ptl);
>>>>>> -			split_huge_pmd(vma, pmdp, addr);
>>>>>> -		} else {
>>>>>> -			int ret;
>>>>>> +			return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +		}
>>>>>> +		if (pmd_write(*pmdp))
>>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>>> +	} else if (!pmd_present(*pmdp)) {
>>>>>> +		entry = pmd_to_swp_entry(*pmdp);
>>>>>> +		folio = pfn_swap_entry_folio(entry);
>>>>>> +
>>>>>> +		if (!is_device_private_entry(entry) ||
>>>>>> +			!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>>>>>> +			(folio->pgmap->owner != migrate->pgmap_owner)) {
>>>>>> +			spin_unlock(ptl);
>>>>>> +			return migrate_vma_collect_skip(start, end, walk);
>>>>>> +		}
>>>>>>  
>>>>>> -			folio_get(folio);
>>>>>> +		if (is_migration_entry(entry)) {
>>>>>> +			migration_entry_wait_on_locked(entry, ptl);
>>>>>>  			spin_unlock(ptl);
>>>>>> -			/* FIXME: we don't expect THP for fault_folio */
>>>>>> -			if (WARN_ON_ONCE(fault_folio == folio))
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> -			if (unlikely(!folio_trylock(folio)))
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> -			ret = split_folio(folio);
>>>>>> -			if (fault_folio != folio)
>>>>>> -				folio_unlock(folio);
>>>>>> -			folio_put(folio);
>>>>>> -			if (ret)
>>>>>> -				return migrate_vma_collect_skip(start, end,
>>>>>> -								walk);
>>>>>> +			return -EAGAIN;
>>>>>>  		}
>>>>>> +
>>>>>> +		if (is_writable_device_private_entry(entry))
>>>>>> +			write = MIGRATE_PFN_WRITE;
>>>>>> +	} else {
>>>>>> +		spin_unlock(ptl);
>>>>>> +		return -EAGAIN;
>>>>>> +	}
>>>>>> +
>>>>>> +	folio_get(folio);
>>>>>> +	if (folio != fault_folio && unlikely(!folio_trylock(folio))) {
>>>>>> +		spin_unlock(ptl);
>>>>>> +		folio_put(folio);
>>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>>> +	}
>>>>>> +
>>>>>> +	if (thp_migration_supported() &&
>>>>>> +		(migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
>>>>>> +		(IS_ALIGNED(start, HPAGE_PMD_SIZE) &&
>>>>>> +		 IS_ALIGNED(end, HPAGE_PMD_SIZE))) {
>>>>>> +
>>>>>> +		struct page_vma_mapped_walk pvmw = {
>>>>>> +			.ptl = ptl,
>>>>>> +			.address = start,
>>>>>> +			.pmd = pmdp,
>>>>>> +			.vma = walk->vma,
>>>>>> +		};
>>>>>> +
>>>>>> +		unsigned long pfn = page_to_pfn(folio_page(folio, 0));
>>>>>> +
>>>>>> +		migrate->src[migrate->npages] = migrate_pfn(pfn) | write
>>>>>> +						| MIGRATE_PFN_MIGRATE
>>>>>> +						| MIGRATE_PFN_COMPOUND;
>>>>>> +		migrate->dst[migrate->npages++] = 0;
>>>>>> +		migrate->cpages++;
>>>>>> +		ret = set_pmd_migration_entry(&pvmw, folio_page(folio, 0));
>>>>>> +		if (ret) {
>>>>>> +			migrate->npages--;
>>>>>> +			migrate->cpages--;
>>>>>> +			migrate->src[migrate->npages] = 0;
>>>>>> +			migrate->dst[migrate->npages] = 0;
>>>>>> +			goto fallback;
>>>>>> +		}
>>>>>> +		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
>>>>>> +		spin_unlock(ptl);
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +fallback:
>>>>>> +	spin_unlock(ptl);
>>>>>> +	if (!folio_test_large(folio))
>>>>>> +		goto done;
>>>>>> +	ret = split_folio(folio);
>>>>>> +	if (fault_folio != folio)
>>>>>> +		folio_unlock(folio);
>>>>>> +	folio_put(folio);
>>>>>> +	if (ret)
>>>>>> +		return migrate_vma_collect_skip(start, end, walk);
>>>>>> +	if (pmd_none(pmdp_get_lockless(pmdp)))
>>>>>> +		return migrate_vma_collect_hole(start, end, -1, walk);
>>>>>> +
>>>>>> +done:
>>>>>> +	return -ENOENT;
>>>>>> +}
>>>>>> +
>>>>>> +static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>> +				   unsigned long start,
>>>>>> +				   unsigned long end,
>>>>>> +				   struct mm_walk *walk)
>>>>>> +{
>>>>>> +	struct migrate_vma *migrate = walk->private;
>>>>>> +	struct vm_area_struct *vma = walk->vma;
>>>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>>>> +	unsigned long addr = start, unmapped = 0;
>>>>>> +	spinlock_t *ptl;
>>>>>> +	struct folio *fault_folio = migrate->fault_page ?
>>>>>> +		page_folio(migrate->fault_page) : NULL;
>>>>>> +	pte_t *ptep;
>>>>>> +
>>>>>> +again:
>>>>>> +	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
>>>>>> +		int ret = migrate_vma_collect_huge_pmd(pmdp, start, end, walk, fault_folio);
>>>>>> +
>>>>>> +		if (ret == -EAGAIN)
>>>>>> +			goto again;
>>>>>> +		if (ret == 0)
>>>>>> +			return 0;
>>>>>>  	}
>>>>>>  
>>>>>>  	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> @@ -222,8 +334,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
>>>>>>  		}
>>>>>>  
>>>>>> -		/* FIXME support THP */
>>>>>> -		if (!page || !page->mapping || PageTransCompound(page)) {
>>>>>> +		if (!page || !page->mapping) {
>>>>>>  			mpfn = 0;
>>>>>>  			goto next;
>>>>>>  		}
>>>>>> @@ -394,14 +505,6 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
>>>>>>  	 */
>>>>>>  	int extra = 1 + (page == fault_page);
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * FIXME support THP (transparent huge page), it is bit more complex to
>>>>>> -	 * check them than regular pages, because they can be mapped with a pmd
>>>>>> -	 * or with a pte (split pte mapping).
>>>>>> -	 */
>>>>>> -	if (folio_test_large(folio))
>>>>>> -		return false;
>>>>>> -
>>>>> You cannot remove this check unless support normal mTHP folios migrate to device, 
>>>>> which I think this series doesn't do, but maybe should?
>>>>>
>>>> mTHP needs to be a follow up series, there are comments in the cover letter. I had an
>>>> assert earlier to prevent other folio sizes, but I've removed it and the interface
>>>> to drivers does not allow for mTHP device private folios.
>>>>
>>>>
>>> pte mapped device private THPs with other sizes also can be created as a result of pmd and folio splits. 
>>> Your should handle them in one place consistently not by different drivers. 
>>> As pointed by Matthew, there's the problem with the fault_page if not split and just ignored.
>>>
>>
>> I've not run into this with my testing, let me try with more mTHP sizes enabled. I'll wait on Matthew
>> to post his test case or any results, issues seen
>>
> 
> I’ve hit this. In the code I shared privately, I split THPs in the
> page-collection path. You omitted that in v2 and v3; I believe you’ll
> need those changes. The code I'm referring to had the below comment.
> 
>  416         /*
>  417          * XXX: No clean way to support higher-order folios that don't match PMD
>  418          * boundaries for now — split them instead. Once mTHP support lands, add
>  419          * proper support for this case.
>  420          *
>  421          * The test, which exposed this as problematic, remapped (memremap) a
>  422          * large folio to an unaligned address, resulting in the folio being
>  423          * found in the middle of the PTEs. The requested number of pages was
>  424          * less than the folio size. Likely to be handled gracefully by upper
>  425          * layers eventually, but not yet.
>  426          */
> 
> I triggered it by doing some odd mremap operations, which caused the CPU
> page-fault handler to spin indefinitely iirc. In that case, a large device
> folio had been moved into the middle of a PMD.
> 

That is interesting, in general migrate_vma_pages is protected by address alignment
checks and PMD checks, for the case you describe even after the folio comes
in the middle of the PMD, it should get split in migrate_vma_setup(), if the PMD
is split

> Upstream could see the same problem if the device fault handler enforces
> a must-migrate-to-device policy and mremap moves a large CPU folio into
> the middle of a PMD.
> 
> I’m in the middle of other work; when I circle back, I’ll try to create
> a selftest to reproduce this. My current test is a fairly convoluted IGT
> with a bunch of threads doing remap nonsense, but I’ll try to distill it
> into a concise selftest.
> 

Look forward to your tests and feedback based on the integration of this
series with your work.

Thanks,
Balbir

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ