lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a1281bc-f423-4558-8052-009458f1390d@lucifer.local>
Date: Wed, 12 Nov 2025 14:06:52 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Zi Yan <ziy@...dia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        Christian Borntraeger <borntraeger@...ux.ibm.com>,
        Janosch Frank <frankja@...ux.ibm.com>,
        Claudio Imbrenda <imbrenda@...ux.ibm.com>,
        David Hildenbrand <david@...hat.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
        Sven Schnelle <svens@...ux.ibm.com>, Peter Xu <peterx@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
        Arnd Bergmann <arnd@...db.de>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
        Lance Yang <lance.yang@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
        Oscar Salvador <osalvador@...e.de>, Vlastimil Babka <vbabka@...e.cz>,
        Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Matthew Brost <matthew.brost@...el.com>,
        Joshua Hahn <joshua.hahnjy@...il.com>, Rakie Kim <rakie.kim@...com>,
        Byungchul Park <byungchul@...com>, Gregory Price <gourry@...rry.net>,
        Ying Huang <ying.huang@...ux.alibaba.com>,
        Alistair Popple <apopple@...dia.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
        Kemeng Shi <shikemeng@...weicloud.com>,
        Kairui Song <kasong@...cent.com>, Nhat Pham <nphamcs@...il.com>,
        Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>,
        SeongJae Park <sj@...nel.org>, Matthew Wilcox <willy@...radead.org>,
        Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
        Xu Xin <xu.xin16@....com.cn>,
        Chengming Zhou <chengming.zhou@...ux.dev>,
        Jann Horn <jannh@...gle.com>, Miaohe Lin <linmiaohe@...wei.com>,
        Naoya Horiguchi <nao.horiguchi@...il.com>,
        Pedro Falcato <pfalcato@...e.de>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Rik van Riel <riel@...riel.com>, Harry Yoo <harry.yoo@...cle.com>,
        Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org, linux-s390@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        linux-arch@...r.kernel.org, damon@...ts.linux.dev
Subject: Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to
 simplify leaf entry logic

On Tue, Nov 11, 2025 at 11:40:32AM -0500, Zi Yan wrote:
> On 11 Nov 2025, at 2:31, Lorenzo Stoakes wrote:
>
> > On Mon, Nov 10, 2025 at 10:56:33PM -0500, Zi Yan wrote:
> >> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
> >>
> >>> The kernel maintains leaf page table entries which contain either:
> >>>
> >>> - Nothing ('none' entries)
> >>> - Present entries (that is stuff the hardware can navigate without fault)
> >>> - Everything else that will cause a fault which the kernel handles
> >>>
> >>> In the 'everything else' group we include swap entries, but we also include
> >>> a number of other things such as migration entries, device private entries
> >>> and marker entries.
> >>>
> >>> Unfortunately this 'everything else' group expresses everything through
> >>> a swp_entry_t type, and these entries are referred to swap entries even
> >>> though they may well not contain a... swap entry.
> >>>
> >>> This is compounded by the rather mind-boggling concept of a non-swap swap
> >>> entry (checked via non_swap_entry()) and the means by which we twist and
> >>> turn to satisfy this.
> >>>
> >>> This patch lays the foundation for reducing this confusion.
> >>>
> >>> We refer to 'everything else' as a 'software-define leaf entry' or
> >>> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> >>> concept also so we are left with:
> >>>
> >>> - Present entries.
> >>> - Softleaf entries (which may be empty).
> >>>
> >>> This allows for radical simplification across the board - one can simply
> >>> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
> >>>
> >>> If the entry is present, we return an empty leaf entry, so it is assumed
> >>> the caller is aware that they must differentiate between the two categories
> >>> of page table entries, checking for the former via pte_present().
> >>>
> >>> As a result, we can eliminate a number of places where we would otherwise
> >>> need to use predicates to see if we can proceed with leaf page table entry
> >>> conversion and instead just go ahead and do it unconditionally.
> >>>
> >>> We do so where we can, adjusting surrounding logic as necessary to
> >>> integrate the new softleaf_t logic as far as seems reasonable at this
> >>> stage.
> >>>
> >>> We typedef swp_entry_t to softleaf_t for the time being until the
> >>> conversion can be complete, meaning everything remains compatible
> >>> regardless of which type is used. We will eventually remove swp_entry_t
> >>> when the conversion is complete.
> >>>
> >>> We introduce a new header file to keep things clear - leafops.h - this
> >>> imports swapops.h so can direct replace swapops imports without issue, and
> >>> we do so in all the files that require it.
> >>>
> >>> Additionally, add new leafops.h file to core mm maintainers entry.
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> >>> ---
> >>>  MAINTAINERS                   |   1 +
> >>>  fs/proc/task_mmu.c            |  26 +--
> >>>  fs/userfaultfd.c              |   6 +-
> >>>  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
> >>>  include/linux/mm_inline.h     |   6 +-
> >>>  include/linux/mm_types.h      |  25 +++
> >>>  include/linux/swapops.h       |  28 ---
> >>>  include/linux/userfaultfd_k.h |  51 +----
> >>>  mm/hmm.c                      |   2 +-
> >>>  mm/hugetlb.c                  |  37 ++--
> >>>  mm/madvise.c                  |  16 +-
> >>>  mm/memory.c                   |  41 ++--
> >>>  mm/mincore.c                  |   6 +-
> >>>  mm/mprotect.c                 |   6 +-
> >>>  mm/mremap.c                   |   4 +-
> >>>  mm/page_vma_mapped.c          |  11 +-
> >>>  mm/shmem.c                    |   7 +-
> >>>  mm/userfaultfd.c              |   6 +-
> >>>  18 files changed, 502 insertions(+), 164 deletions(-)
> >>>  create mode 100644 include/linux/leafops.h
> >>>
> >>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>> index 2628431dcdfe..314910a70bbf 100644
> >>> --- a/MAINTAINERS
> >>> +++ b/MAINTAINERS
> >>> @@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >>>  F:	include/linux/gfp.h
> >>>  F:	include/linux/gfp_types.h
> >>>  F:	include/linux/highmem.h
> >>> +F:	include/linux/leafops.h
> >>>  F:	include/linux/memory.h
> >>>  F:	include/linux/mm.h
> >>>  F:	include/linux/mm_*.h
> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>> index fc35a0543f01..24d26b49d870 100644
> >>> --- a/fs/proc/task_mmu.c
> >>> +++ b/fs/proc/task_mmu.c
> >>> @@ -14,7 +14,7 @@
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/swap.h>
> >>>  #include <linux/sched/mm.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/mmu_notifier.h>
> >>>  #include <linux/page_idle.h>
> >>>  #include <linux/shmem_fs.h>
> >>> @@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >>>  	if (pte_present(ptent)) {
> >>>  		folio = page_folio(pte_page(ptent));
> >>>  		present = true;
> >>> -	} else if (is_swap_pte(ptent)) {
> >>> -		swp_entry_t swpent = pte_to_swp_entry(ptent);
> >>> +	} else {
> >>> +		const softleaf_t entry = softleaf_from_pte(ptent);
> >>>
> >>> -		if (is_pfn_swap_entry(swpent))
> >>> -			folio = pfn_swap_entry_folio(swpent);
> >>> +		if (softleaf_has_pfn(entry))
> >>> +			folio = softleaf_to_folio(entry);
> >>>  	}
> >>>
> >>>  	if (folio) {
> >>
> >> <snip>
> >>
> >>>
> >>> @@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
> >>>  		if (pte_soft_dirty(pte))
> >>>  			categories |= PAGE_IS_SOFT_DIRTY;
> >>>  	} else if (is_swap_pte(pte)) {
> >>
> >> This should be just “else” like smaps_hugetlb_range()’s change, right?
> >
> > This is code this patch doesn't touch? :) It's not my fault...
> >
> > Actually in a follow-up patch I do exactly this, taking advantage of the fact
> > that we handle none entries automatically in softleaf_from_pte().
> >
> > But it's onne step at a time here to make it easier to review/life easier on
> > bisect in case there's any mistakes.
>
> OK.

Yeah you're ahead of the game :)

>
> >
> >>
> >>> -		swp_entry_t swp;
> >>> +		softleaf_t entry;
> >>>
> >>>  		categories |= PAGE_IS_SWAPPED;
> >>>  		if (!pte_swp_uffd_wp_any(pte))
> >>>  			categories |= PAGE_IS_WRITTEN;
> >>>
> >>> -		swp = pte_to_swp_entry(pte);
> >>> -		if (is_guard_swp_entry(swp))
> >>> +		entry = softleaf_from_pte(pte);
> >>> +		if (softleaf_is_guard_marker(entry))
> >>>  			categories |= PAGE_IS_GUARD;
> >>>  		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
> >>> -			 is_pfn_swap_entry(swp) &&
> >>> -			 !folio_test_anon(pfn_swap_entry_folio(swp)))
> >>> +			 softleaf_has_pfn(entry) &&
> >>> +			 !folio_test_anon(softleaf_to_folio(entry)))
> >>>  			categories |= PAGE_IS_FILE;
> >>>
> >>>  		if (pte_swp_soft_dirty(pte))
> >>
> >> <snip>
> >>
> >>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >>> index 137ce27ff68c..be20468fb5a9 100644
> >>> --- a/mm/page_vma_mapped.c
> >>> +++ b/mm/page_vma_mapped.c
> >>> @@ -3,7 +3,7 @@
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/hugetlb.h>
> >>>  #include <linux/swap.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>
> >>>  #include "internal.h"
> >>>
> >>> @@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
> >>>  	pte_t ptent = ptep_get(pvmw->pte);
> >>>
> >>>  	if (pvmw->flags & PVMW_MIGRATION) {
> >>> -		swp_entry_t entry;
> >>> -		if (!is_swap_pte(ptent))
> >>> -			return false;
> >>> -		entry = pte_to_swp_entry(ptent);
> >>> +		const softleaf_t entry = softleaf_from_pte(ptent);
> >>
> >> We do not need is_swap_pte() check here because softleaf_from_pte()
> >> does the check. Just trying to reason the code with myself here.
> >
> > Right, see the next patch :) I'm laying the groundwork for us to be able to do
> > that.
> >
> >>
> >>>
> >>> -		if (!is_migration_entry(entry))
> >>> +		if (!softleaf_is_migration(entry))
> >>>  			return false;
> >>>
> >>> -		pfn = swp_offset_pfn(entry);
> >>> +		pfn = softleaf_to_pfn(entry);
> >>>  	} else if (is_swap_pte(ptent)) {
> >>>  		swp_entry_t entry;
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 6580f3cd24bb..395ca58ac4a5 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
> >>>  #include <linux/falloc.h>
> >>>  #include <linux/splice.h>
> >>>  #include <linux/security.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/mempolicy.h>
> >>>  #include <linux/namei.h>
> >>>  #include <linux/ctype.h>
> >>> @@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>  	struct address_space *mapping = inode->i_mapping;
> >>>  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >>>  	struct shmem_inode_info *info = SHMEM_I(inode);
> >>> -	swp_entry_t swap, index_entry;
> >>> +	swp_entry_t swap;
> >>> +	softleaf_t index_entry;
> >>>  	struct swap_info_struct *si;
> >>>  	struct folio *folio = NULL;
> >>>  	bool skip_swapcache = false;
> >>> @@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>  	swap = index_entry;
> >>>  	*foliop = NULL;
> >>>
> >>> -	if (is_poisoned_swp_entry(index_entry))
> >>> +	if (softleaf_is_poison_marker(index_entry))
> >>>  		return -EIO;
> >>>
> >>>  	si = get_swap_device(index_entry);
> >>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >>> index cc4ce205bbec..055ec1050776 100644
> >>> --- a/mm/userfaultfd.c
> >>> +++ b/mm/userfaultfd.c
> >>> @@ -10,7 +10,7 @@
> >>>  #include <linux/pagemap.h>
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/swap.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/userfaultfd_k.h>
> >>>  #include <linux/mmu_notifier.h>
> >>>  #include <linux/hugetlb.h>
> >>> @@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> >>>  	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
> >>>  	 * page cache page backing it, then access the page.
> >>>  	 */
> >>> -	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> >>> +	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
> >>>  		goto out_unlock;
> >>>
> >>>  	if (page_in_cache) {
> >>> @@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> >>>  		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> >>>  			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> >>>
> >>> -			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> >>> +			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
> >>>  				err = -EEXIST;
> >>>  				hugetlb_vma_unlock_read(dst_vma);
> >>>  				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >>
> >> The rest of the code looks good to me. I will check it again once
> >> you fix the commit log and comments. Thank you for working on this.
> >
> > As I said before I'm not respinning this entire series to change every single
> > reference to present/none to include one or several paragraphs about how we
> > hacked in protnone and other such things.
>
> No, I do not want you to do that.

Good! :)

>
> >
> > If I have to respin the series, I'll add a reference in the commit log.
> >
> > I beleive the only pertinent comment is:
> >
> > + * If referencing another page table or a data page then the page table entry is
> > + * pertinent to hardware - that is it tells the hardware how to decode the page
> > + * table entry.
>
> I would just remove “(that is stuff the hardware can navigate without fault)”.
> People can look at the definition of present entries to get the categorization.
> Basically, you just need to only talk about present entries without mentioning
> whether it is HW accessible or not, since that is another can of worms.

OK I'll ask Andrew to update.

>
> >
> > From the softleaf_t kdoc.
> >
> > I think this is fine as-is - protnone entries or _PAGE_PSE-only PMD entries
> > _are_ pertinent to the hardware fault handler, literally every bit except for
> > the present bit are set ready for the hardware to decode, telling it how to
> > decode the leaf entry.
>
> After reading it again, I agree the kdoc looks good.

Thanks!

>
> >
> > Rather than adding additional confusion by citing this stuff and probably
> > whatever awful architecture-specific stuff lurks in the arch/ directory I think
> > we are fine as-is.
> >
> > Again we decided as a community to hack this stuff in so we as a community have
> > to live with it like a guy who puts a chimney on his car :)
> >
> > (mm has many such chimneys on a car that only Homer Simpson would be proud of)
>
> Yeah, it is not pretty, but that is how people get their work done. ;)

Well, it's also how people mess things up :)

I wish we'd done it differently, ideally by separating hardware and sofware page
table state.

pte_sw_present() vs. pte_hw_present() would have cleared this up immediately.

Maybe one I should do at some point...

>
> Anyway, feel free to add Acked-by: Zi Yan <ziy@...dia.com>

Thanks!

>
> Best Regards,
> Yan, Zi

Your review is much appreciated :)

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ