linux-kernel - Re: [PATCH] mm: implement "memory.oops_if_bad

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <525a4060-2c8b-40c5-b4bd-b9c47de94f0f@lucifer.local>
Date: Thu, 10 Jul 2025 17:16:52 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Alexey Dobriyan <adobriyan@...il.com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, David Hildenbrand <david@...hat.com>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option

Sorry but no - this seems to me to just be a hack. And it also appears to
violate the rules on BUG_ON() (see [0]) so this is just a no.

[0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/

On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> Implement
>
> 	memory.oops_if_bad_pte=1

This is a totally new paradigm afaict - introducing an oops based on user
input, I really don't think that's sensible.

Unless kernel.panic_on_oops is set this won't necessarily cause anything to
halt. Really you want a panic_on_bad_pte here, but that would be way way
too specific.

So it seems like a hack just so you can get a vmcore?

You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
doing this you're inferring that there's unrecoverable system instability,
which isf clearly not the case.

All of the bad PTE handling seems to be intended to be recoverable and
handled by calling code.

Additionally we have uses like zap_present_folio_ptes() which use it to
output PTE state in the instance of an invalid mapcount value - I don't
think oopsing there would really be what you wanted right?

>
> boot option which oopses the machine instead of dreadful
>
> 	BUG: Bad page map in process
>
> message.

I'm not sure what's so dreadful about it? And why an oops is better?

>
> This is intended
> for people who want to panic at the slightest provocation and
> for people who ruled out hardware problems which in turn means that
> delaying vmcore collection is counter-productive.

Seems to be a specific edge case.

>
> Linux doesn't (never?) panicked on PTE corruption and even implemented
> ratelimited version of the message meaning it can go for minutes and
> even hours without anyone noticing which is exactly the opposite of what
> should be done to facilitate debugging.

But are there real situations you can cite where this has been problematic?

>
> Not enabled by default.

Yeah, obviously.

>
> Not advertised.

Umm why? Seems like you just want to add this for your own very specific
purpose?

>
> Signed-off-by: Alexey Dobriyan <adobriyan@...il.com>
> ---
>
>  mm/memory.c                                 | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..90b92b312802 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/moduleparam.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  			add_mm_counter(mm, i, rss[i]);
>  }
>
> +/*
> + * Oops instead of printing "Bad page map in process" message and
> + * trying to continue.
> + */
> +static bool oops_if_bad_pte __ro_after_init = false;
> +module_param(oops_if_bad_pte, bool, 0444);
> +
>  /*
>   * This function is called to print an error when a bad pte
>   * is found. For example, we might have a PFN-mapped pte in
> @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  			  pte_t pte, struct page *page)
>  {
> +	/*
> +	 * This line is a formality to collect vmcore ASAP. Real bug
> +	 * (hardware or software) happened earlier, current registers and
> +	 * backtrace aren't interesting.
> +	 */
> +	BUG_ON(oops_if_bad_pte);

Except that it won't without panic_on_oops?

I mean we can't just go around putting arbitrary BUG_ON()'s like this for
cases we want data on.

And far worse here - this is a print_xxx() function, and you're making it
oops? That's really bad.

> +
>  	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	pud_t *pud = pud_offset(p4d, addr);
> --
> 2.49.0
>

Note that other page table levels can be 'bad' as well, see pgd_bad() et
al. - none of these will be caught.

Overall I suspect there's one single case you're worried about, that really
you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
get what you want.

If you can make an argument in favour of this that's convincing then that
would be a potentially upstreamable patch, but this one isn't, in my view.