[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b0131cf-d066-44ba-96d9-a611448cbaf9@redhat.com>
Date: Wed, 31 Jul 2024 18:33:52 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
James Houghton <jthoughton@...gle.com>, stable@...r.kernel.org,
Oscar Salvador <osalvador@...e.de>, Muchun Song <muchun.song@...ux.dev>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Michael Ellerman <mpe@...erman.id.au>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Nicholas Piggin <npiggin@...il.com>
Subject: Re: [PATCH v3] mm/hugetlb: fix hugetlb vs. core-mm PT locking
On 31.07.24 16:54, Peter Xu wrote:
> On Wed, Jul 31, 2024 at 02:21:03PM +0200, David Hildenbrand wrote:
>> We recently made GUP's common page table walking code to also walk hugetlb
>> VMAs without most hugetlb special-casing, preparing for the future of
>> having less hugetlb-specific page table walking code in the codebase.
>> Turns out that we missed one page table locking detail: page table locking
>> for hugetlb folios that are not mapped using a single PMD/PUD.
>>
>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the
>> page tables, will perform a pte_offset_map_lock() to grab the PTE table
>> lock.
>>
>> However, hugetlb that concurrently modifies these page tables would
>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
>> locks would differ. Something similar can happen right now with hugetlb
>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.
>>
>> This issue can be reproduced [1], for example triggering:
>>
>> [ 3105.936100] ------------[ cut here ]------------
>> [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
>> [ 3105.944634] Modules linked in: [...]
>> [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
>> [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
>> [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 3105.991108] pc : try_grab_folio+0x11c/0x188
>> [ 3105.994013] lr : follow_page_pte+0xd8/0x430
>> [ 3105.996986] sp : ffff80008eafb8f0
>> [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
>> [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
>> [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
>> [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
>> [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
>> [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
>> [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
>> [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
>> [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
>> [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
>> [ 3106.047957] Call trace:
>> [ 3106.049522] try_grab_folio+0x11c/0x188
>> [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
>> [ 3106.055527] follow_page_mask+0x1a0/0x2b8
>> [ 3106.058118] __get_user_pages+0xf0/0x348
>> [ 3106.060647] faultin_page_range+0xb0/0x360
>> [ 3106.063651] do_madvise+0x340/0x598
>>
>> Let's make huge_pte_lockptr() effectively use the same PT locks as any
>> core-mm page table walker would. Add ptep_lockptr() to obtain the PTE
>> page table lock using a pte pointer -- unfortunately we cannot convert
>> pte_lockptr() because virt_to_page() doesn't work with kmap'ed page
>> tables we can have with CONFIG_HIGHPTE.
>>
>> Take care of PTE tables possibly spanning multiple pages, and take care of
>> CONFIG_PGTABLE_LEVELS complexity when e.g., PMD_SIZE == PUD_SIZE. For
>> example, with CONFIG_PGTABLE_LEVELS == 2, core-mm would detect
>> with hugepagesize==PMD_SIZE pmd_leaf() and use the pmd_lockptr(), which
>> would end up just mapping to the per-MM PT lock.
>>
>> There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
>> folio being mapped using two PTE page tables. While hugetlb wants to take
>> the PMD table lock, core-mm would grab the PTE table lock of one of both
>> PTE page tables. In such corner cases, we have to make sure that both
>> locks match, which is (fortunately!) currently guaranteed for 8xx as it
>> does not support SMP and consequently doesn't use split PT locks.
>>
>> [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/
>>
>> Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
>> Reviewed-by: James Houghton <jthoughton@...gle.com>
>> Cc: <stable@...r.kernel.org>
>> Cc: Peter Xu <peterx@...hat.com>
>> Cc: Oscar Salvador <osalvador@...e.de>
>> Cc: Muchun Song <muchun.song@...ux.dev>
>> Cc: Baolin Wang <baolin.wang@...ux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@...hat.com>
>
> Nitpick: I wonder whether some of the lines can be simplified if we write
> it downwards from PUD, like,
>
> huge_pte_lockptr()
> {
> if (size >= PUD_SIZE)
> return pud_lockptr(...);
> if (size >= PMD_SIZE)
> return pmd_lockptr(...);
> /* Sub-PMD only applies to !CONFIG_HIGHPTE, see pte_alloc_huge() */
> WARN_ON(IS_ENABLED(CONFIG_HIGHPTE));
> return ptep_lockptr(...);
> }
Let me think about it. For PUD_SIZE == PMD_SIZE instead of like core-mm
calling pmd_lockptr we'd call pud_lockptr().
Likely it would work because we default in most cases to the per-MM lock:
arch/x86/Kconfig: select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
>
> The ">=" checks should avoid checking over pgtable level, iiuc.
>
> The other nitpick is, I didn't yet find any arch that use non-zero order
> page for pte pgtables. I would give it a shot with dropping the mask thing
> then see what explodes (which I don't expect any, per my read..), but yeah
> I understand we saw some already due to other things, so I think it's fine
> in this hugetlb path (that we're removing) we do a few more math if you
> think that's easier for you.
I threw
BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
into pte_lockptr() and did a bunch of cross-compiles.
And for some reason it blows up for powernv (powernv_defconfig) and
pseries (pseries_defconfig).
In function 'pte_lockptr',
inlined from 'pte_offset_map_nolock' at mm/pgtable-generic.c:316:11:
././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE
510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
491 | prefix ## suffix(); \
| ^~~~~~
././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^~~~~~~~~~~~~~~~
./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON'
2926 | BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
| ^~~~~~~~~~~~
In function 'pte_lockptr',
inlined from '__pte_offset_map_lock' at mm/pgtable-generic.c:374:8:
././include/linux/compiler_types.h:510:45: error: call to '__compiletime_assert_291' declared with attribute error: BUILD_BUG_ON failed: PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE
510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
././include/linux/compiler_types.h:491:25: note: in definition of macro '__compiletime_assert'
491 | prefix ## suffix(); \
| ^~~~~~
././include/linux/compiler_types.h:510:9: note: in expansion of macro '_compiletime_assert'
510 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^~~~~~~~~~~~~~~~
./include/linux/mm.h:2926:9: note: in expansion of macro 'BUILD_BUG_ON'
2926 | BUILD_BUG_ON(PTRS_PER_PTE * sizeof(pte_t) > PAGE_SIZE);
| ^~~~~~~~~~~~
pte_alloc_one() ends up calling pte_fragment_alloc(mm, 0). But there we always
end up calling pagetable_alloc(, 0).
And fragments are supposed to be <= a single page.
Now I'm confused what's wrong here ... am I missing something obvious?
CCing some powerpc folks. Is this some pte_t oddity?
But in mm_inc_nr_ptes/mm_dec_nr_ptes we use the exact same calculation :/
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists