linux-kernel - Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <200da552-4fc7-44d8-bbea-1669b4b45cf5@lucifer.local>
Date: Fri, 18 Jul 2025 11:15:47 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        xen-devel@...ts.xenproject.org, linux-fsdevel@...r.kernel.org,
        nvdimm@...ts.linux.dev, Andrew Morton <akpm@...ux-foundation.org>,
        Juergen Gross <jgross@...e.com>,
        Stefano Stabellini <sstabellini@...nel.org>,
        Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
        Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
        Hugh Dickins <hughd@...gle.com>, Oscar Salvador <osalvador@...e.de>,
        Lance Yang <lance.yang@...ux.dev>
Subject: Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to
 print_bad_page_map()

On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote:
> > > The report will now look something like (dumping pgd to pmd values):
> > >
> > > [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> > > [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > > [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> > >
> > > Not using pgdp_get(), because that does not work properly on some arm
> > > configs where pgd_t is an array. Note that we are dumping all levels
> > > even when levels are folded for simplicity.
> >
> > Oh god. I reviewed this below. BUT OH GOD. What. Why???
>
> Exactly.
>
> I wish this patch wouldn't exist, but Hugh convinced me that apparently it
> can be useful in the real world.

Yeah... I mean out of scope for here but that sounds dubious. On the other hand,
we use typedef for page table values etc. etc. so we do make this possible.

>
> >
> > >
> > > Signed-off-by: David Hildenbrand <david@...hat.com>
> > > ---
> > >   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> > >   1 file changed, 94 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 173eb6267e0ac..08d16ed7b4cc7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > >   			add_mm_counter(mm, i, rss[i]);
> > >   }
> > >
> > > -/*
> > > - * This function is called to print an error when a bad pte
> > > - * is found. For example, we might have a PFN-mapped pte in
> > > - * a region that doesn't allow it.
> > > - *
> > > - * The calling function must still handle the error.
> > > - */
> > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > -			  pte_t pte, struct page *page)
> > > +static bool is_bad_page_map_ratelimited(void)
> > >   {
> > > -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > -	p4d_t *p4d = p4d_offset(pgd, addr);
> > > -	pud_t *pud = pud_offset(p4d, addr);
> > > -	pmd_t *pmd = pmd_offset(pud, addr);
> > > -	struct address_space *mapping;
> > > -	pgoff_t index;
> > >   	static unsigned long resume;
> > >   	static unsigned long nr_shown;
> > >   	static unsigned long nr_unshown;
> > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (nr_shown == 60) {
> > >   		if (time_before(jiffies, resume)) {
> > >   			nr_unshown++;
> > > -			return;
> > > +			return true;
> > >   		}
> > >   		if (nr_unshown) {
> > >   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	}
> > >   	if (nr_shown++ == 0)
> > >   		resume = jiffies + 60 * HZ;
> > > +	return false;
> > > +}
> > > +
> > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	unsigned long long pgdv, p4dv, pudv, pmdv;
> >
> > > +	p4d_t p4d, *p4dp;
> > > +	pud_t pud, *pudp;
> > > +	pmd_t pmd, *pmdp;
> > > +	pgd_t *pgdp;
> > > +
> > > +	/*
> > > +	 * This looks like a fully lockless walk, however, the caller is
> > > +	 * expected to hold the leaf page table lock in addition to other
> > > +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> > > +	 * content while any concurrent modifications should be completely
> > > +	 * prevented.
> > > +	 */
> >
> > Hmmm :)
> >
> > Why aren't we trying to lock at leaf level?
>
> Ehm, did you read the:
>
> "the caller is expected to hold the leaf page table lock"
>
> :)

Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
but I guess the intent is that the caller _must_ hold this lock.

I know it's nitty and annoying (sorry!) but as asserting seems to not be a
possibility here, could we spell these out as a series of points like:

/*
 * The caller MUST hold the following locks:
 *
 * - Leaf page table lock
 * - Appropriate VMA lock to keep VMA stable
 */

I don't _actually_ think you need the rmap lock then, as none of the page tables
you access would be impacted by any rmap action afaict, with these locks held.
>
>
> >
> > We need to:
> >
> > - Keep VMA stable which prevents unmap page table teardown and khugepaged
> >    collapse.
> > - (not relevant as we don't traverse PTE table but) RCU lock for PTE
> >    entries to avoid MADV_DONTNEED page table withdrawal.
> >
> > Buuut if we're not locking at leaf level, we leave ourselves open to racing
> > faults, zaps, etc. etc.
>
> Yes. I can clarify in the comment of print_bad_page_map(), that it is
> expected to be called with the PTL (unwritten rule, not changing that).

Right, see above, just needs more clarity then.

>
> >
> > So perhaps this why you require such strict conditions...
> >
> > But can you truly be sure of these existing? And we should then assert them
> > here no? For rmap though we'd need the folio/vma.
>
> I hope you realize that this nastiness of a code is called in case our
> system is already running into something extremely unexpected and will
> probably be dead soon.
>
> So I am not to interested in adding anything more here. If you run into this
> code you're in big trouble already.

Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
held as stated those won't occur.

But f it's not sensible to do it then we don't have to :) I am a reasonable
man, or like to think I am ;)

But I think we need clarity as per the above.

>
> >
> > > +	pgdp = pgd_offset(mm, addr);
> > > +	pgdv = pgd_val(*pgdp);
> >
> > Before I went and looked again at the commit msg I said:
> >
> > 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> > 	 helper for other levels."
> >
> > But obviously yeah. You explained the insane reason why not.
>
> Had to find out the hard way ... :)

Pain.

>
> [...]
>
> > > +/*
> > > + * This function is called to print an error when a bad page table entry (e.g.,
> > > + * corrupted page table entry) is found. For example, we might have a
> > > + * PFN-mapped pte in a region that doesn't allow it.
> > > + *
> > > + * The calling function must still handle the error.
> > > + */
> >
> > We have extremely strict locking conditions for the page table traversal... but
> > no mention of them here?
>
> Yeah, I can add that.

Thanks!

>
> >
> > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +	pgoff_t index;
> > > +
> > > +	if (is_bad_page_map_ratelimited())
> > > +		return;
> > >
> > >   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > >   	index = linear_page_index(vma, addr);
> > >
> > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > -		 current->comm,
> > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> >
> > Sort of wonder if this is even useful if you don't know what the 'entry'
> > is? But I guess the dump below will tell you.
>
> You probably missed in the patch description:
>
> "Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future, we
> could pass the entry level type similar to "enum rmap_level". For now, let's
> keep it simple."

Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
is fine then.

>
> >
> > Though maybe actually useful to see flags etc. in case some horrid
> > corruption happened and maybe dump isn't valid? But then the dump assumes
> > strict conditions to work so... can that happen?
>
> Not sure what you mean, can you elaborate?
>
> If you system is falling apart completely, I'm afraid this report here will
> not safe you.
>
> You'll probably get a flood of 60 ... if your system doesn't collapse before
> that.

I was musing whistfully about the value of outputting the entries. But I
guess _any_ information before a crash is useful. But your dump output will
be a lot more useful :)

>
> >
> > > +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
> > >   	if (page)
> > > -		dump_page(page, "bad pte");
> > > +		dump_page(page, "bad page map");
> > >   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> > >   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > >   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >   		if (is_zero_pfn(pfn))
> > >   			return NULL;
> > >
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >
> > >   check_pfn:
> > >   	if (unlikely(pfn > highest_memmap_pfn)) {
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> >
> > This is unrelated to your series, but I guess this is for cases where
> > you're e.g. iomapping or such? So it's not something in the memmap but it's
> > a PFN that might reference io memory or such?
>
> No, it's just an easy check for catching corruptions. In a later patch I add
> a comment.

Ohhh right. I was overthinking this haha

>
> >
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   {
> > >   	unsigned long pfn = pmd_pfn(pmd);
> > >
> > > -	if (unlikely(pmd_special(pmd)))
> > > +	if (unlikely(pmd_special(pmd))) {
> > > +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +			return NULL;
> >
> > I guess we'll bring this altogether in a later patch with vm_normal_page()
> > as getting a little duplicative :P
>
> Not that part, but the other part :)

Yes :)

>
> >
> > Makes me think that VM_SPECIAL is kind of badly named (other than fact
> > 'special' is nebulous and overloaded in general) in that it contains stuff
> > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> > specialness wrt to underlying folio.
>
> It is.

Yeah I think we in mm periodically moan about this whole thing...

>
> >
> > Then we have VM_IO, which strictly must not have an associated page right?
>
> VM_IO just means read/write side-effects, I think you could have ones with
> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> MMIO regions.

Hmm, but why not have two separate VMAs? I guess I need to look into more
what this flag actually effects.

But in terms of VMA manipulation in any case it really is no different from
e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except
of course this whole vm_normal_page() thing...!

But I've not seen VM_IO set alone anywhere. On the other hand, I haven't
really dug deep on this...

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo