[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e92b75d4-999b-4b0f-b62f-9edca09834c4@redhat.com>
Date: Wed, 16 Jul 2025 10:40:04 +0200
From: David Hildenbrand <david@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: linux-mm@...ck.org, xen-devel@...ts.xenproject.org,
linux-fsdevel@...r.kernel.org, nvdimm@...ts.linux.dev,
Andrew Morton <akpm@...ux-foundation.org>, Juergen Gross <jgross@...e.com>,
Stefano Stabellini <sstabellini@...nel.org>,
Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
Dan Williams <dan.j.williams@...el.com>, Matthew Wilcox
<willy@...radead.org>, Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka
<vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Jann Horn <jannh@...gle.com>, Pedro Falcato <pfalcato@...e.de>,
Hugh Dickins <hughd@...gle.com>, Oscar Salvador <osalvador@...e.de>,
Lance Yang <lance.yang@...ux.dev>, Russell King <linux@...linux.org.uk>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v1 6/9] mm/memory: convert print_bad_pte() to
print_bad_page_map()
On 15.07.25 15:23, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
>
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [ 77.943408] BUG: Bad page map in process XXX entry:80000001233f5867
> [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
> mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 94 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a4f62923b961c..00ee0df020503 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -479,22 +479,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> add_mm_counter(mm, i, rss[i]);
> }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
> {
> - pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> - p4d_t *p4d = p4d_offset(pgd, addr);
> - pud_t *pud = pud_offset(p4d, addr);
> - pmd_t *pmd = pmd_offset(pud, addr);
> - struct address_space *mapping;
> - pgoff_t index;
> static unsigned long resume;
> static unsigned long nr_shown;
> static unsigned long nr_unshown;
> @@ -506,7 +492,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> if (nr_shown == 60) {
> if (time_before(jiffies, resume)) {
> nr_unshown++;
> - return;
> + return true;
> }
> if (nr_unshown) {
> pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -517,15 +503,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> }
> if (nr_shown++ == 0)
> resume = jiffies + 60 * HZ;
> + return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> + unsigned long long pgdv, p4dv, pudv, pmdv;
> + pgd_t pgd, *pgdp;
> + p4d_t p4d, *p4dp;
> + pud_t pud, *pudp;
> + pmd_t *pmdp;
> +
> + /*
> + * This looks like a fully lockless walk, however, the caller is
> + * expected to hold the leaf page table lock in addition to other
> + * rmap/mm/vma locks. So this is just a re-walk to dump page table
> + * content while any concurrent modifications should be completely
> + * prevented.
> + */
> + pgdp = pgd_offset(mm, addr);
> + pgd = pgdp_get(pgdp);
> + pgdv = pgd_val(pgd);
Apparently there is something weird here on arm-bcm2835_defconfig:
All errors (new ones prefixed by >>):
>> mm/memory.c:525:6: error: array type 'pgd_t' (aka 'unsigned int[2]')
is not assignable
525 | pgd = pgdp_get(pgdp);
| ~~~ ^
1 error generated.
... apparently because we have this questionable ...
arch/arm/include/asm/pgtable-2level-types.h:typedef pmdval_t pgd_t[2];
I mean, the whole concept of pgdp_get() doesn't make too much sense if
it wants to return an array.
I don't quite understand the "#undef STRICT_MM_TYPECHECKS #ifdef
STRICT_MM_TYPECHECKS" stuff.
Why do we want to make it easier on the compiler while doing something
fairly weird?
CCing arm maintainers: what's going on here? :)
An easy fix would be to not dump the pgd value, but having a
non-functional pgdp_get() really is weird.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists