[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84d4e799-90da-487e-adba-6174096283b5@redhat.com>
Date: Thu, 4 Jul 2024 17:23:30 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>
Cc: Oscar Salvador <osalvador@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Muchun Song <muchun.song@...ux.dev>,
SeongJae Park <sj@...nel.org>, Miaohe Lin <linmiaohe@...wei.com>,
Michal Hocko <mhocko@...e.com>, Matthew Wilcox <willy@...radead.org>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH 00/45] hugetlb pagewalk unification
On 04.07.24 16:30, Peter Xu wrote:
> Hey, David,
>
Hi!
> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
>> There are roughly two categories of page table walkers we have:
>>
>> 1) We actually only want to walk present folios (to be precise, page
>> ranges of folios). We should look into moving away from the walk the
>> page walker API where possible, and have something better that
>> directly gives us the folio (page ranges). Any PTE batching would be
>> done internally.
>>
>> 2) We want to deal with non-present folios as well (swp entries and all
>> kinds of other stuff). We should maybe implement our custom page
>> table walker and move away from walk_page_range(). We are not walking
>> "pages" after all but everything else included :)
>>
>> Then, there is a subset of 1) where we only want to walk to a single address
>> (a single folio). I'm working on that right now to get rid of follow_page()
>> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
>> remain a bit special, but I'm afraid we cannot hide that completely.
>
> Maybe you are talking about the generic concept of "page table walker", not
> walk_page_range() explicitly?
>
> I'd agree if it's about the generic concept. For example, follow_page()
> definitely is tailored for getting the page/folio. But just to mention
> Oscar's series is only working on the page_walk API itself. What I see so
> far is most of the walk_page API users aren't described above - most of
> them do not fall into category 1) at all, if any. And they either need to
> fetch something from the pgtable where having the folio isn't enough, or
> modify the pgtable for different reasons.
Right, but having 1) does not imply that we won't be having access to
the page table entry in an abstracted form, the folio is simply the
primary source of information that these users care about. 2) is an
extension of 1), but walking+exposing all (or most) other page table
entries as well in some form, which is certainly harder to get right.
Taking a look at some examples:
* madvise_cold_or_pageout_pte_range() only cares about present folios.
* madvise_free_pte_range() only cares about present folios.
* break_ksm_ops() only cares about present folios.
* mlock_walk_ops() only cares about present folios.
* damon_mkold_ops() only cares about present folios.
* damon_young_ops() only cares about present folios.
There are certainly other page_walk API users that are more involved and
need to do way more magic, which fall into category 2). In particular
things like swapin_walk_ops(), hmm_walk_ops() and most
fs/proc/task_mmu.c. Likely there are plenty of them.
Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there
even is left in using walk_page_range() :)
>
> A generic pgtable walker looks still wanted at some point, but it can be
> too involved to be introduced together with this "remove hugetlb_entry"
> effort.
My thinking was if "remove hugetlb_entry" cannot wait for "remove
page_walk", because we found a reasonable way to do it better and
convert the individual users. Maybe it can't.
I've not given up hope that we can end up with something better and
clearer than the current page_walk API :)
>
> To me, that future work is not yet about "get the folio, ignore the
> pgtable", but about how to abstract different layers of pgtables, so the
> caller may get a generic concept of "one pgtable entry" with the level/size
> information attached, and process it at a single place / hook, and perhaps
> hopefully even work with a device pgtable, as long as it's a radix tree.
To me 2) is an extension of 1). My thinking is that we can start with 1)
without having to are about all details of 2). If we have to make it as
generic that we can walk any page table layout out there in this world,
I'm not so sure.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists