linux-kernel - Re: [LSF/MM TOPIC] Non standard size THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190213134859.54tnrkzauj2mftn4@kshutemo-mobl1>
Date:   Wed, 13 Feb 2019 16:48:59 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Anshuman Khandual <anshuman.khandual@....com>
Cc:     lsf-pc@...ts.linux-foundation.org,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [LSF/MM TOPIC] Non standard size THP

On Wed, Feb 13, 2019 at 06:20:03PM +0530, Anshuman Khandual wrote:
> 
> 
> On 02/12/2019 02:03 PM, Kirill A. Shutemov wrote:
> > On Fri, Feb 08, 2019 at 07:43:57AM +0530, Anshuman Khandual wrote:
> >> Hello,
> >>
> >> THP is currently supported for
> >>
> >> - PMD level pages (anon and file)
> >> - PUD level pages (file - DAX file system)
> >>
> >> THP is a single entry mapping at standard page table levels (either PMD or PUD)
> >>
> >> But architectures like ARM64 supports non-standard page table level huge pages
> >> with contiguous bits.
> >>
> >> - These are created as multiple entries at either PTE or PMD level
> >> - These multiple entries carry pages which are physically contiguous
> >> - A special PTE bit (PTE_CONT) is set indicating single entry to be contiguous
> >>
> >> These multiple contiguous entries create a huge page size which is different
> >> than standard PMD/PUD level but they provide benefits of huge memory like
> >> less number of faults, bigger TLB coverage, less TLB miss etc.
> >>
> >> Currently they are used as HugeTLB pages because
> >>
> >> 	- HugeTLB page sizes is carried in the VMA
> >> 	- Page table walker can operate on multiple PTE or PMD entries given its size in VMA
> >> 	- Irrespective of HugeTLB page size its operated with set_huge_pte_at() at any level
> >> 	- set_huge_pte_at() is arch specific which knows how to encode multiple consecutive entries
> >> 	
> >> But not as THP huge pages because
> >>
> >> 	- THP size is not encoded any where like VMA
> >> 	- Page table walker expects it to be either at PUD (HPAGE_PUD_SIZE) or at PMD (HPAGE_PMD_SIZE)
> >> 	- Page table operates directly with set_pmd_at() or set_pud_at()
> >> 	- Direct faulted or promoted huge pages is verified with [pmd|pud]_trans_huge()
> >>
> >> How non-standard huge pages can be supported for THP
> >>
> >> 	- THP starts recognizing non standard huge page (exported by arch) like HPAGE_CONT_(PMD|PTE)_SIZE
> >> 	- THP starts operating for either on HPAGE_PMD_SIZE or HPAGE_CONT_PMD_SIZE or HPAGE_CONT_PTE_SIZE
> >> 	- set_pmd_at() only recognizes HPAGE_PMD_SIZE hence replace set_pmd_at() with set_huge_pmd_at()
> >> 	- set_huge_pmd_at() could differentiate between HPAGE_PMD_SIZE or HPAGE_CONT_PMD_SIZE
> >> 	- In case for HPAGE_CONT_PTE_SIZE extend page table walker till PTE level
> >> 	- Use set_huge_pte_at() which can operate on multiple contiguous PTE bits
> > 
> > You only listed trivial things. All tricky stuff is what make THP
> > transparent.
> 
> Agreed. I was trying to draw an analogy from HugeTLB with respect to page
> table creation and it's walking. Huge page collapse and split on such non
> standard huge pages will involve taking care of much details.
> 
> > 
> > To consider it seriously we need to understand what it means for
> > split_huge_p?d()/split_huge_page()? How khugepaged will deal with this?
> 
> Absolutely. Can these operate on non standard probably multi entry based
> huge pages ? How to handle atomicity etc.

We need to handle split for them to provide transparency.

> > In particular, I'm worry to expose (to user or CPU) page table state in
> > the middle of conversion (huge->small or small->huge). Handling this on
> > page table level provides a level atomicity that you will not have.
> 
> I understand it might require a software based lock instead of standard HW
> atomicity constructs which will make it slow but is that even possible ?

I'm not yet sure if it is possible. I don't yet wrap my head around the
idea yet.

> > Honestly, I'm very skeptical about the idea. It took a lot of time to
> > stabilize THP for singe page size, equal to PMD page table, but this looks
> > like a new can of worms. :P
> 
> I understand your concern here but HW providing some more TLB sizes beyond
> standard page table level (PMD/PUD/PGD) based huge pages can help achieve
> performance improvement when the buddy is already fragmented enough not to
> provide higher order pages. PUD THP file mapping is already supported for
> DAX and PUD THP anon mapping might be supported in near future (it is not
> much challenging other than allocating HPAGE_PUD_SIZE huge page at runtime
> will be much difficult).

That's a bold claim. I would like to look at code. :)

Supporting more than one THP page size at the same time brings a lot more
questions, besides allocation path (although I'm sure compaction will be
happy about this).

For instance, what page size you'll allocate for a given fault
address?

How do you deal with pre-allocated page tables? Deposit 513 page tables
for a given PUD THP page might be fun. :P

> Around PMD sizes like HPAGE_CONT_PMD_SIZE or
> HPAGE_CONT_PTE_SIZE really have better chances as future non-PMD level anon
> mapping than a PUD size anon mapping support in THP.
> 
> > 
> > It *might* be possible to support it for DAX, but beyond that...
> >
> 
> Did not get that. Why would you think that this is possible or appropriate
> only for DAX file mapping but not for anon mapping ?

DAX THP is inherently simpler: no struct pages -- less state to track and
no need in split_huge_page(), split_huge_p?d() can be handled by dropping
entities in question and re-faulting them as smaller entires. No problem
with compation...

-- 
 Kirill A. Shutemov