linux-kernel - Re: [PATCH] mm: implement "memory.oops_if_bad

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72956765-39e0-445a-b381-6bbc54046544@p183>
Date: Thu, 10 Jul 2025 19:57:00 +0300
From: Alexey Dobriyan <adobriyan@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, David Hildenbrand <david@...hat.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option

On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote:
> Sorry but no - this seems to me to just be a hack. And it also appears to
> violate the rules on BUG_ON() (see [0]) so this is just a no.
> 
> [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> 
> On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> > Implement
> >
> > 	memory.oops_if_bad_pte=1
> 
> This is a totally new paradigm afaict - introducing an oops based on user
> input, I really don't think that's sensible.
> 
> Unless kernel.panic_on_oops is set this won't necessarily cause anything to
> halt. Really you want a panic_on_bad_pte here, but that would be way way
> too specific.
> 
> So it seems like a hack just so you can get a vmcore?
> 
> You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
> doing this you're inferring that there's unrecoverable system instability,
> which isf clearly not the case.
> 
> All of the bad PTE handling seems to be intended to be recoverable and
> handled by calling code.
> 
> Additionally we have uses like zap_present_folio_ptes() which use it to
> output PTE state in the instance of an invalid mapcount value - I don't
> think oopsing there would really be what you wanted right?
> 
> >
> > boot option which oopses the machine instead of dreadful
> >
> > 	BUG: Bad page map in process
> >
> > message.
> 
> I'm not sure what's so dreadful about it?

Because the root cause is unknown, happened at unknown time, dmesg
rotated away and nobody bothered to coredump the machine because it
didn't oops!

> And why an oops is better?

I apologize for stating the obvious but the less time between the bug
and coredump collection the better.

> > This is intended
> > for people who want to panic at the slightest provocation and
> > for people who ruled out hardware problems which in turn means that
> > delaying vmcore collection is counter-productive.
> 
> Seems to be a specific edge case.

Yes, but the option is not enabled by default and costs 2 instructions
on the coldest code path, so...

> > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > ratelimited version of the message meaning it can go for minutes and
> > even hours without anyone noticing which is exactly the opposite of what
> > should be done to facilitate debugging.
> 
> But are there real situations you can cite where this has been problematic?
> 
> >
> > Not enabled by default.
> 
> Yeah, obviously.
> 
> >
> > Not advertised.
> 
> Umm why? Seems like you just want to add this for your own very specific
> purpose?

Sort of, I don't want to patch and unpatch things every time.

> > +/*
> > + * Oops instead of printing "Bad page map in process" message and
> > + * trying to continue.
> > + */
> > +static bool oops_if_bad_pte __ro_after_init = false;
> > +module_param(oops_if_bad_pte, bool, 0444);
> > +
> >  /*
> >   * This function is called to print an error when a bad pte
> >   * is found. For example, we might have a PFN-mapped pte in
> > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> >  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> >  			  pte_t pte, struct page *page)
> >  {
> > +	/*
> > +	 * This line is a formality to collect vmcore ASAP. Real bug
> > +	 * (hardware or software) happened earlier, current registers and
> > +	 * backtrace aren't interesting.
> > +	 */
> > +	BUG_ON(oops_if_bad_pte);
> 
> Except that it won't without panic_on_oops?

Yes, I'll update the comment. it is supposed to be used with
panic_on_oops=1 for maximum effect.

> I mean we can't just go around putting arbitrary BUG_ON()'s like this for
> cases we want data on.

Yes, we can!

> And far worse here - this is a print_xxx() function, and you're making it
> oops? That's really bad.

It's fine because, it is conditional BUG_ON.

> Note that other page table levels can be 'bad' as well, see pgd_bad() et
> al. - none of these will be caught.

Sure, I didn't think much about spreading this option to other places.
It can be spread independently.

> Overall I suspect there's one single case you're worried about, that really
> you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
> get what you want.

Ehh, no. WARN is for home users who can maybe photo the oops and fish it
out of dmesg and make bug report -- so that system survives until their
data are flushed to disk.

I suspect users are very bifurcated: some want to panic always, some
want to panic during QA but not in the field, and then there are users
whose only hope is cellphone camera.

> If you can make an argument in favour of this that's convincing then that
> would be a potentially upstreamable patch, but this one isn't, in my view.