linux-kernel - Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b75c4db-9dbe-4ff1-b649-06a9218ae0aa@csgroup.eu>
Date: Wed, 10 Apr 2024 16:30:41 +0000
From: Christophe Leroy <christophe.leroy@...roup.eu>
To: Peter Xu <peterx@...hat.com>, Jason Gunthorpe <jgg@...dia.com>
CC: "linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linuxppc-dev@...ts.ozlabs.org"
	<linuxppc-dev@...ts.ozlabs.org>, Michael Ellerman <mpe@...erman.id.au>,
	Matthew Wilcox <willy@...radead.org>, Rik van Riel <riel@...riel.com>,
	Lorenzo Stoakes <lstoakes@...il.com>, Axel Rasmussen
	<axelrasmussen@...gle.com>, Yang Shi <shy828301@...il.com>, John Hubbard
	<jhubbard@...dia.com>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "Kirill A . Shutemov"
	<kirill@...temov.name>, Andrew Jones <andrew.jones@...ux.dev>, Vlastimil
 Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, Andrew Morton
	<akpm@...ux-foundation.org>, Muchun Song <muchun.song@...ux.dev>, Christoph
 Hellwig <hch@...radead.org>, "linux-riscv@...ts.infradead.org"
	<linux-riscv@...ts.infradead.org>, James Houghton <jthoughton@...gle.com>,
	David Hildenbrand <david@...hat.com>, Andrea Arcangeli <aarcange@...hat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@...nel.org>, Mike Kravetz
	<mike.kravetz@...cle.com>
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2



Le 10/04/2024 à 17:28, Peter Xu a écrit :
> On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
>> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
>>> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
>>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
>>> not all.  I need to look at all the walkers and there can be some tricky
>>> ones, but I believe that applies in general.  It's actually similar to what
>>> I did with slow gup here.
>>
>> I think that is the big question, I also haven't done the research to
>> know the answer.
>>
>> At this point focusing on moving what is reasonable to the pXX_* API
>> makes sense to me. Then reviewing what remains and making some
>> decision.
>>
>>> Like this series, for cont-pXd we'll need multiple walks comparing to
>>> before (when with hugetlb_entry()), but for that part I'll provide some
>>> performance tests too, and we also have a fallback plan, which is to detect
>>> cont-pXd existance, which will also work for large folios.
>>
>> I think we can optimize this pretty easy.
>>   
>>>> I think if you do the easy places for pXX conversion you will have a
>>>> good idea about what is needed for the hard places.
>>>
>>> Here IMHO we don't need to understand "what is the size of this hugetlb
>>> vma"
>>
>> Yeh, I never really understood why hugetlb was linked to the VMA.. The
>> page table is self describing, obviously.
> 
> Attaching to vma still makes sense to me, where we should definitely avoid
> a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> allocated, managed, ...  totally differently.
> 
> And since hugetlb is designed as file-based (which also makes sense to me,
> at least for now), it's also natural that it's vma-attached.
> 
>>
>>> or "which level of pgtable does this hugetlb vma pages locate",
>>
>> Ditto
>>
>>> because we may not need that, e.g., when we only want to collect some smaps
>>> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
>>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
>>> hugetlb_entry removed), we may need extra check later to put things into
>>> the right bucket, but for the walker itself it doesn't necessarily need
>>> hugetlb_entry().
>>
>> Right, places may still need to know it is part of a huge VMA because we
>> have special stuff linked to that.
>>
>>>> But then again we come back to power and its big list of page sizes
>>>> and variety :( Looks like some there have huge sizes at the pgd level
>>>> at least.
>>>
>>> Yeah this is something I want to be super clear, because I may miss
>>> something: we don't have real pgd pages, right?  Powerpc doesn't even
>>> define p4d_leaf(), AFAICT.
>>
>> AFAICT it is because it hides it all in hugepd.
> 
> IMHO one thing we can benefit from such hugepd rework is, if we can squash
> all the hugepds like what Christophe does, then we push it one more layer
> down, and we have a good chance all things should just work.
> 
> So again my Power brain is close to zero, but now I'm referring to what
> Christophe shared in the other thread:
> 
> https://github.com/linuxppc/wiki/wiki/Huge-pages
> 
> Together with:
> 
> https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.leroy@csgroup.eu
> 
> Where it has:
> 
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
>          select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>          select ARCH_ENABLE_SPLIT_PMD_PTLOCK
>          select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>          select ARCH_SUPPORTS_HUGETLBFS
>          select ARCH_SUPPORTS_NUMA_BALANCING
>          select HAVE_MOVE_PMD
> @@ -290,6 +291,7 @@ config PPC_BOOK3S
>   config PPC_E500
>          select FSL_EMB_PERFMON
>          bool
> +       select ARCH_HAS_HUGEPD if HUGETLB_PAGE
>          select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
>          select PPC_SMP_MUXED_IPI
>          select PPC_DOORBELL
> 
> So I think it means we have three PowerPC systems that supports hugepd
> right now (besides the 8xx which Christophe is trying to drop support
> there), besides 8xx we still have book3s_64 and E500.
> 
> Let's check one by one:
> 
>    - book3s_64
> 
>      - hash
> 
>        - 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
>          means after squashing it'll be a bunch of cont-pmd, all good.
> 
>        - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
>          it'll be cont-pud. all good.
> 
>      - radix
> 
>        - 64K: largest 1G @pud, then cont-pmd after squashed. all good.
> 
>        - 4K: largest 1G @pud, then cont-pmd, all good.
> 
>    - e500 & 8xx
> 
>      - both of them use 2-level pgtables (pgd + pte), after squashed hugepd
>        @pgd level they become cont-pte. all good.

e500 has two modes: 32 bits and 64 bits.

For 32 bits:

8xx is the only one handling it through HW-assisted pagetable walk hence 
requiring a 2-level whatever the pagesize is.

On e500 it is all software so pages 2M and larger should be cont-PGD (by 
the way I'm a bit puzzled that on arches that have only 2 levels, ie PGD 
and PTE, the PGD entries are populated by a function called PMD_populate()).

Current situation for 8xx is illustrated here: 
https://github.com/linuxppc/wiki/wiki/Huge-pages#8xx

I also tried to better illustrate e500/32 here: 
https://github.com/linuxppc/wiki/wiki/Huge-pages#e500

For 64 bits:
We have PTE/PMD/PUD/PGD, no P4D

See arch/powerpc/include/asm/nohash/64/pgtable-4k.h


> 
> I think the trick here is there'll be no pgd leaves after hugepd squashing
> to lower levels, then since PowerPC seems to never have p4d, then all
> things fall into pud or lower.  We seem to be all good there?
> 
>>
>> If the goal is to purge hugepd then some of the options might turn out
>> to convert hugepd into huge p4d/pgd, as I understand it. It would be
>> nice to have certainty on this at least.
> 
> Right.  I hope the pmd/pud plan I proposed above can already work too with
> such ambicious goal too.  But review very welcomed from either you or
> Christophe.
> 
> PS: I think I'll also have a closer look at Christophe's series this week
> or next.
> 
>>
>> We have effectively three APIs to parse a single page table and
>> currently none of the APIs can return 100% of the data for power.
> 
> Thanks,
>