[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6963662f-eade-405f-83dc-6fdd05e202f3@nvidia.com>
Date: Tue, 8 Jul 2025 14:59:11 +1000
From: Balbir Singh <balbirs@...dia.com>
To: Alistair Popple <apopple@...dia.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, Karol Herbst <kherbst@...hat.com>,
Lyude Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Jérôme Glisse <jglisse@...hat.com>,
Shuah Khan <shuah@...nel.org>, David Hildenbrand <david@...hat.com>,
Barry Song <baohua@...nel.org>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Ryan Roberts <ryan.roberts@....com>, Matthew Wilcox <willy@...radead.org>,
Peter Xu <peterx@...hat.com>, Zi Yan <ziy@...dia.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>, Jane Chu <jane.chu@...cle.com>,
Donet Tom <donettom@...ux.ibm.com>
Subject: Re: [v1 resend 03/12] mm/thp: zone_device awareness in THP handling
code
On 7/7/25 16:07, Alistair Popple wrote:
> On Fri, Jul 04, 2025 at 09:35:02AM +1000, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages
>> aware of zone device pages. Although the code is
>> designed to be generic when it comes to handling splitting
>> of pages, the code is designed to work for THP page sizes
>> corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone
>> device huge entry is present, enabling try_to_migrate()
>> and other code migration paths to appropriately process the
>> entry
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for
>> zone device entries.
>>
>> try_to_map_to_unused_zeropage() does not apply to zone device
>> entries, zone device entries are ignored in the call.
>>
>> Cc: Karol Herbst <kherbst@...hat.com>
>> Cc: Lyude Paul <lyude@...hat.com>
>> Cc: Danilo Krummrich <dakr@...nel.org>
>> Cc: David Airlie <airlied@...il.com>
>> Cc: Simona Vetter <simona@...ll.ch>
>> Cc: "Jérôme Glisse" <jglisse@...hat.com>
>> Cc: Shuah Khan <shuah@...nel.org>
>> Cc: David Hildenbrand <david@...hat.com>
>> Cc: Barry Song <baohua@...nel.org>
>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@....com>
>> Cc: Matthew Wilcox <willy@...radead.org>
>> Cc: Peter Xu <peterx@...hat.com>
>> Cc: Zi Yan <ziy@...dia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@...wei.com>
>> Cc: Jane Chu <jane.chu@...cle.com>
>> Cc: Alistair Popple <apopple@...dia.com>
>> Cc: Donet Tom <donettom@...ux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@...dia.com>
>> ---
>> mm/huge_memory.c | 153 +++++++++++++++++++++++++++++++------------
>> mm/migrate.c | 2 +
>> mm/page_vma_mapped.c | 10 +++
>> mm/pgtable-generic.c | 6 ++
>> mm/rmap.c | 19 +++++-
>> 5 files changed, 146 insertions(+), 44 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ce130225a8e5..e6e390d0308f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1711,7 +1711,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> if (unlikely(is_swap_pmd(pmd))) {
>> swp_entry_t entry = pmd_to_swp_entry(pmd);
>>
>> - VM_BUG_ON(!is_pmd_migration_entry(pmd));
>> + VM_BUG_ON(!is_pmd_migration_entry(pmd) &&
>> + !is_device_private_entry(entry));
>> if (!is_readable_migration_entry(entry)) {
>> entry = make_readable_migration_entry(
>> swp_offset(entry));
>> @@ -2222,10 +2223,17 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> } else if (thp_migration_supported()) {
>> swp_entry_t entry;
>>
>> - VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>> entry = pmd_to_swp_entry(orig_pmd);
>> folio = pfn_swap_entry_folio(entry);
>> flush_needed = 0;
>> +
>> + VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> + !folio_is_device_private(folio));
>> +
>> + if (folio_is_device_private(folio)) {
>> + folio_remove_rmap_pmd(folio, folio_page(folio, 0), vma);
>> + WARN_ON_ONCE(folio_mapcount(folio) < 0);
>> + }
>> } else
>> WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
>>
>> @@ -2247,6 +2255,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> folio_mark_accessed(folio);
>> }
>>
>> + /*
>> + * Do a folio put on zone device private pages after
>> + * changes to mm_counter, because the folio_put() will
>> + * clean folio->mapping and the folio_test_anon() check
>> + * will not be usable.
>> + */
>> + if (folio_is_device_private(folio))
>> + folio_put(folio);
>> +
>> spin_unlock(ptl);
>> if (flush_needed)
>> tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
>> @@ -2375,7 +2392,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> struct folio *folio = pfn_swap_entry_folio(entry);
>> pmd_t newpmd;
>>
>> - VM_BUG_ON(!is_pmd_migration_entry(*pmd));
>> + VM_BUG_ON(!is_pmd_migration_entry(*pmd) &&
>> + !folio_is_device_private(folio));
>> if (is_writable_migration_entry(entry)) {
>> /*
>> * A protection check is difficult so
>> @@ -2388,9 +2406,11 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> newpmd = swp_entry_to_pmd(entry);
>> if (pmd_swp_soft_dirty(*pmd))
>> newpmd = pmd_swp_mksoft_dirty(newpmd);
>> - } else {
>> + } else if (is_writable_device_private_entry(entry)) {
>> + newpmd = swp_entry_to_pmd(entry);
>> + entry = make_device_exclusive_entry(swp_offset(entry));
>
> Argh. The naming here is terribly confusing (of which I'm at least partly to
> blame) because it ended up clashing with David's PG_anon_exclusive which is a
> completely different concept - see 6c287605fd56 ("mm: remember exclusively
> mapped anonymous pages with PG_anon_exclusive").
>
> The exclusive entries you are creating here are for emulating atomic access -
> see the documentation for make_device_exclusive() for more details - and are
> almost certainly not what you want.
>
> As far as I understand things we don't need to create anon exclusive entries for
> device private pages because they can never be pinned, so likely you just want
> make_readable_device_private_entry() here. If we really want to track anon
> exclusive status you probably need pte_swp_exclusive(), but then we should do it
> for non-THP device private pages as well and that sounds like a whole different
> problem/patch series.
>
Thanks for catching this, I agree we don't need to track anon exclusive for device
private pages.
>> + } else
>> newpmd = *pmd;
>> - }
>>
>> if (uffd_wp)
>> newpmd = pmd_swp_mkuffd_wp(newpmd);
>> @@ -2842,16 +2862,20 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> struct page *page;
>> pgtable_t pgtable;
>> pmd_t old_pmd, _pmd;
>> - bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>> - bool anon_exclusive = false, dirty = false;
>> + bool young, write, soft_dirty, uffd_wp = false;
>> + bool anon_exclusive = false, dirty = false, present = false;
>> unsigned long addr;
>> pte_t *pte;
>> int i;
>> + swp_entry_t swp_entry;
>>
>> VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
>> VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
>> VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
>> - VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
>> +
>> + VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
>> + && !(is_swap_pmd(*pmd) &&
>> + is_device_private_entry(pmd_to_swp_entry(*pmd))));
>>
>> count_vm_event(THP_SPLIT_PMD);
>>
>> @@ -2899,20 +2923,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> return __split_huge_zero_page_pmd(vma, haddr, pmd);
>> }
>>
>> - pmd_migration = is_pmd_migration_entry(*pmd);
>> - if (unlikely(pmd_migration)) {
>> - swp_entry_t entry;
>>
>> + present = pmd_present(*pmd);
>> + if (unlikely(!present)) {
>> + swp_entry = pmd_to_swp_entry(*pmd);
>> old_pmd = *pmd;
>> - entry = pmd_to_swp_entry(old_pmd);
>> - page = pfn_swap_entry_to_page(entry);
>> - write = is_writable_migration_entry(entry);
>> +
>> + folio = pfn_swap_entry_folio(swp_entry);
>> + VM_BUG_ON(!is_migration_entry(swp_entry) &&
>> + !is_device_private_entry(swp_entry));
>> + page = pfn_swap_entry_to_page(swp_entry);
>> + write = is_writable_migration_entry(swp_entry);
>> +
>> if (PageAnon(page))
>> - anon_exclusive = is_readable_exclusive_migration_entry(entry);
>> - young = is_migration_entry_young(entry);
>> - dirty = is_migration_entry_dirty(entry);
>> + anon_exclusive =
>> + is_readable_exclusive_migration_entry(swp_entry);
>> soft_dirty = pmd_swp_soft_dirty(old_pmd);
>> uffd_wp = pmd_swp_uffd_wp(old_pmd);
>> + young = is_migration_entry_young(swp_entry);
>> + dirty = is_migration_entry_dirty(swp_entry);
>
> This could be a device-private swp_entry right? In which case calling
> is_migration_entry_*() on them isn't correct. I suspect you want to have
> separate code paths for migration vs. device_private entries here.
>
Yep, I will split them up based on the entry type
>> } else {
>> /*
>> * Up to this point the pmd is present and huge and userland has
>> @@ -2996,30 +3025,45 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> * Note that NUMA hinting access restrictions are not transferred to
>> * avoid any possibility of altering permissions across VMAs.
>> */
>> - if (freeze || pmd_migration) {
>> + if (freeze || !present) {
>> for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
>> pte_t entry;
>> - swp_entry_t swp_entry;
>> -
>> - if (write)
>> - swp_entry = make_writable_migration_entry(
>> - page_to_pfn(page + i));
>> - else if (anon_exclusive)
>> - swp_entry = make_readable_exclusive_migration_entry(
>> - page_to_pfn(page + i));
>> - else
>> - swp_entry = make_readable_migration_entry(
>> - page_to_pfn(page + i));
>> - if (young)
>> - swp_entry = make_migration_entry_young(swp_entry);
>> - if (dirty)
>> - swp_entry = make_migration_entry_dirty(swp_entry);
>> - entry = swp_entry_to_pte(swp_entry);
>> - if (soft_dirty)
>> - entry = pte_swp_mksoft_dirty(entry);
>> - if (uffd_wp)
>> - entry = pte_swp_mkuffd_wp(entry);
>> -
>> + if (freeze || is_migration_entry(swp_entry)) {
>> + if (write)
>> + swp_entry = make_writable_migration_entry(
>> + page_to_pfn(page + i));
>> + else if (anon_exclusive)
>> + swp_entry = make_readable_exclusive_migration_entry(
>> + page_to_pfn(page + i));
>> + else
>> + swp_entry = make_readable_migration_entry(
>> + page_to_pfn(page + i));
>> + if (young)
>> + swp_entry = make_migration_entry_young(swp_entry);
>> + if (dirty)
>> + swp_entry = make_migration_entry_dirty(swp_entry);
>> + entry = swp_entry_to_pte(swp_entry);
>> + if (soft_dirty)
>> + entry = pte_swp_mksoft_dirty(entry);
>> + if (uffd_wp)
>> + entry = pte_swp_mkuffd_wp(entry);
>> + } else {
>> + VM_BUG_ON(!is_device_private_entry(swp_entry));
>> + if (write)
>> + swp_entry = make_writable_device_private_entry(
>> + page_to_pfn(page + i));
>> + else if (anon_exclusive)
>> + swp_entry = make_device_exclusive_entry(
>
> As above, you're welcome for the naming :-)
>
:) We don't need to track these, I'll fix up the patch
>> + page_to_pfn(page + i));
>> + else
>> + swp_entry = make_readable_device_private_entry(
>> + page_to_pfn(page + i));
>> + entry = swp_entry_to_pte(swp_entry);
>> + if (soft_dirty)
>> + entry = pte_swp_mksoft_dirty(entry);
>> + if (uffd_wp)
>> + entry = pte_swp_mkuffd_wp(entry);
>> + }
>> VM_WARN_ON(!pte_none(ptep_get(pte + i)));
>> set_pte_at(mm, addr, pte + i, entry);
>> }
>> @@ -3046,7 +3090,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> }
>> pte_unmap(pte);
>>
>> - if (!pmd_migration)
>> + if (present)
>> folio_remove_rmap_pmd(folio, page, vma);
>> if (freeze)
>> put_page(page);
>> @@ -3058,8 +3102,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>> void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
>> pmd_t *pmd, bool freeze)
>> {
>> +
>> VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
>> - if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
>> + if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
>> + (is_swap_pmd(*pmd) &&
>
> Should we create is_pmd_device_entry() to match is_pmd_migration_entry()?
>
Yes, I think that's reasonable, I'll look into it
>> + is_device_private_entry(pmd_to_swp_entry(*pmd))))
>> __split_huge_pmd_locked(vma, pmd, address, freeze);
>> }
>>
>> @@ -3238,6 +3285,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>> VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
>> lockdep_assert_held(&lruvec->lru_lock);
>>
>> + if (folio_is_device_private(folio))
>> + return;
>> +
>> if (list) {
>> /* page reclaim is reclaiming a huge page */
>> VM_WARN_ON(folio_test_lru(folio));
>> @@ -3252,6 +3302,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>> list_add_tail(&new_folio->lru, &folio->lru);
>> folio_set_lru(new_folio);
>> }
>> +
>> }
>>
>> /* Racy check whether the huge page can be split */
>> @@ -3543,6 +3594,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>> ((mapping || swap_cache) ?
>> folio_nr_pages(release) : 0));
>>
>> + if (folio_is_device_private(release))
>> + percpu_ref_get_many(&release->pgmap->ref,
>> + (1 << new_order) - 1);
>
> Is there a description somewhere for how we think pgmap->ref works for compound/
> higher-order device private pages? Not that it matters too much, I'd like to
> remove it. Maybe I can create a patch to do that which you can base on top of.
>
This bit is not needed, I'll be removing the above lines of code.
>> lru_add_split_folio(origin_folio, release, lruvec,
>> list);
>>
>> @@ -4596,7 +4651,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>> return 0;
>>
>> flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>> - pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> + if (!folio_is_device_private(folio))
>> + pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>> + else
>> + pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
>
> Do we need to flush? A device private entry is already non-present so is the
> flush necessary?
>
We're clearing the entry as well, why do you think a flush is not required?
>>
>> /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
>> anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
>> @@ -4646,6 +4704,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>> entry = pmd_to_swp_entry(*pvmw->pmd);
>> folio_get(folio);
>> pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
>> +
>> + if (unlikely(folio_is_device_private(folio))) {
>> + if (pmd_write(pmde))
>> + entry = make_writable_device_private_entry(
>> + page_to_pfn(new));
>> + else
>> + entry = make_readable_device_private_entry(
>> + page_to_pfn(new));
>> + pmde = swp_entry_to_pmd(entry);
>> + }
>> +
>> if (pmd_swp_soft_dirty(*pvmw->pmd))
>> pmde = pmd_mksoft_dirty(pmde);
>> if (is_writable_migration_entry(entry))
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 767f503f0875..0b6ecf559b22 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -200,6 +200,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
>>
>> if (PageCompound(page))
>> return false;
>> + if (folio_is_device_private(folio))
>> + return false;
>> VM_BUG_ON_PAGE(!PageAnon(page), page);
>> VM_BUG_ON_PAGE(!PageLocked(page), page);
>> VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index e981a1a292d2..ff8254e52de5 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -277,6 +277,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>> * cannot return prematurely, while zap_huge_pmd() has
>> * cleared *pmd but not decremented compound_mapcount().
>> */
>> + swp_entry_t entry;
>> +
>> + if (!thp_migration_supported())
>> + return not_found(pvmw);
>> + entry = pmd_to_swp_entry(pmde);
>> + if (is_device_private_entry(entry)) {
>> + pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>> + return true;
>
> Do other callers of page_vma_mapped_walk() need to be updated now that large
> device private pages may be returned?
>
I think we probably need a new flag in the page walk code to return true for device
private entries at points where the callers/walkers care about tracking them.
>> + }
>> +
>> if ((pvmw->flags & PVMW_SYNC) &&
>> thp_vma_suitable_order(vma, pvmw->address,
>> PMD_ORDER) &&
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 567e2d084071..604e8206a2ec 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>> *pmdvalp = pmdval;
>> if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>> goto nomap;
>> + if (is_swap_pmd(pmdval)) {
>> + swp_entry_t entry = pmd_to_swp_entry(pmdval);
>> +
>> + if (is_device_private_entry(entry))
>> + goto nomap;
>> + }
>> if (unlikely(pmd_trans_huge(pmdval)))
>> goto nomap;
>> if (unlikely(pmd_bad(pmdval))) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index bd83724d14b6..da1e5b03e1fe 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2336,8 +2336,23 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>> break;
>> }
>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>> - subpage = folio_page(folio,
>> - pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
>> + /*
>> + * Zone device private folios do not work well with
>> + * pmd_pfn() on some architectures due to pte
>> + * inversion.
>> + */
>> + if (folio_is_device_private(folio)) {
>> + swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
>> + unsigned long pfn = swp_offset_pfn(entry);
>> +
>> + subpage = folio_page(folio, pfn
>> + - folio_pfn(folio));
>> + } else {
>> + subpage = folio_page(folio,
>> + pmd_pfn(*pvmw.pmd)
>> + - folio_pfn(folio));
>> + }
>> +
>> VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
>> !folio_test_pmd_mappable(folio), folio);
>>
>> --
>> 2.49.0
>>
Thanks for the detailed feedback
Balbir
Powered by blists - more mailing lists