lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d5980e3-72e6-4848-b1ac-83ffab8522c4@redhat.com>
Date: Wed, 10 Jul 2024 05:52:43 +0200
From: David Hildenbrand <david@...hat.com>
To: Oscar Salvador <osalvador@...e.de>
Cc: Peter Xu <peterx@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Muchun Song <muchun.song@...ux.dev>, SeongJae Park <sj@...nel.org>,
 Miaohe Lin <linmiaohe@...wei.com>, Michal Hocko <mhocko@...e.com>,
 Matthew Wilcox <willy@...radead.org>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH 00/45] hugetlb pagewalk unification

On 08.07.24 10:18, Oscar Salvador wrote:
> On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
>> My thinking was if "remove hugetlb_entry" cannot wait for "remove
>> page_walk", because we found a reasonable way to do it better and convert
>> the individual users. Maybe it can't.
>>
>> I've not given up hope that we can end up with something better and clearer
>> than the current page_walk API :)
> 
> Hi David,
> 

Hi!

> I agree that the current page_walk might be a bit convoluted, and that the
> indirect functions approach is a bit of a hassle.
> Having said that, let me clarify something.
> 
> Although this patchset touches the page_walk API wrt. getting rid of
> hugetlb special casing all over the place, my goal was not as focused on
> the page_walk as it was on the hugetlb code to gain hability to be
> interpreted on PUD/PMD level.

I understand that. And it would all be easier+more straight forward if 
we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that 
works similar, but different to "ordinary" cont-pte for thp.

I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. 
If we, at one point *don't* use these hugetlb functions right now to 
modify hugetlb entries, we might be in trouble.

That's why I think we should maybe invest our time and effort in having 
a new pagewalker that will just batch such things naturally, and users 
that can operate on that naturally. For example: a hugetlb 
cont-pte-mapped folio will just naturally be reported as a "fully mapped 
folio", just like a THP would be if mapped in a compatible way.

Yes, this requires more work, but as raised in some patches here, 
working on individual PTEs/PMDs for hugetlb is problematic.

You have to batch every operation, to essentially teach ordinary code to 
do what the hugetlb_* special code would have done on cont-pte/cont-pmd 
things.


(as a side note, cont-pte/cont-pmd should primarily be a hint from arch 
code on how many entries we can batch, like we do in folio_pte_batch(); 
point is that we want to batch also on architectures where we don't have 
such bits, and prepare for architectures that implement various sizes of 
batching; IMHO, having cont-pte/cont-pmd checks in common code is likely 
the wrong approach. Again, folio_pte_batch() is where we tackled the 
problem differently from the THP perspective)

> 
> One of the things, among other things, that helped in creating this
> mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
> __always__ operates on ptes, which means that we cannot rely on the mm
> core to do the right thing, and we need a bunch of hugetlb-pte functions
> that knows about their thing, so we lean on that.
> 
> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.
> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.
> 
> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).
> 
> E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
>       is_hwpoison_entry
> 
> Yes, it is a hassle to have more code around, but IMO, this new code
> will help us in 1) move away from __always__ operate on ptes 2) ease
> integrate hugetlb code into mm core.
> 
> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.
> 
> If that gets accomplished, I think that a fair amount of code that lives
> in hugetlb.c can be deleted/converted as less special casing will be needed.
> 
> I might be wrong and maybe I will hit a brick wall, but hopefully not.

I have an idea for a better page table walker API that would try 
batching most entries (under one PTL), and walkers can just register for 
the types they want. Hoping I will find some time to at least scetch the 
user interface soon.

That doesn't mean that this should block your work, but the 
cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I 
don't particularly like where this is going.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ