linux-kernel - Re: [PATCH v2 2/2] x86/kgdb: fix hang on failed breakpoint removal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <383e12bebba34afc3779aad14deb3a43e9cbb5d9.camel@florommel.de>
Date: Tue, 13 Aug 2024 17:06:45 +0200
From: Florian Rommel <mail@...rommel.de>
To: Daniel Thompson <daniel.thompson@...aro.org>, Thomas Gleixner
	 <tglx@...utronix.de>
Cc: Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave
 Hansen <dave.hansen@...ux.intel.com>, "H . Peter Anvin" <hpa@...or.com>,
 Jason Wessel <jason.wessel@...driver.com>, Douglas Anderson
 <dianders@...omium.org>,  Lorena Kretzschmar <qy15sije@....cs.fau.de>,
 Stefan Saecherl <stefan.saecherl@....de>, Peter Zijlstra
 <peterz@...radead.org>, Christophe JAILLET <christophe.jaillet@...adoo.fr>,
  Randy Dunlap <rdunlap@...radead.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>,
 Christophe Leroy <christophe.leroy@...roup.eu>,  Geert Uytterhoeven
 <geert+renesas@...der.be>, kgdb-bugreport@...ts.sourceforge.net,
 x86@...nel.org,  linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 2/2] x86/kgdb: fix hang on failed breakpoint removal

On Tue, 2024-08-13 at 12:31 +0100, Daniel Thompson wrote:
> On Mon, Aug 12, 2024 at 11:04:13PM +0200, Thomas Gleixner wrote:
> 
> > 
> > Btw, kgdb_skipexception() is a gross nisnomer because it rewinds the
> > instruction pointer to the exception address and does not skip anything,
> > but that's an orthogonal issue though it could be cleaned up along the
> > way...
> 
> kgdb_skipexception() is not a directive from debug core to architecture.
> It is a question to the archictecture: should the debug core skip normal
> debug exception handling and resume immediately?.
> 
> It allows an architecture to direct the debug core to ignore a spurious
> trap that, according to the comments, can occur on some
> (micro)architectures when we have already restored the original
> not-a-breakpoint instruction.
> 
> Florian's patch changes things so that we will also skip debug exception
> handling if we can successfully poke to the text section. I don't think
> it is sufficient on its own since the text_mutex could be owned by the
> core that is stuck trapping on the int3 that we can't remove.

Yes, if the text_mutex is owned by the trapping core, that's a inherent
problem. This won't be solved by my patch. But this will probably be
difficult to be solved at all..

The patch only adds an removal attempt, without changing the result of
kgdb_skipexception(). It was thought as a best-effort solution: Better
try removing the breakpoint than getting stuck in the int3 loop for
sure. Of course, there is no guarantee for this being successful.

> 
> 
> > >  {
> > >  	if (exception == 3 && kgdb_isremovedbreak(regs->ip - 1)) {
> > > +		struct kgdb_bkpt *bpt;
> > > +		int i, error;
> > > +
> > >  		regs->ip -= 1;
> > > +
> > > +		/*
> > > +		 * Try to remove the spurious int3 instruction.
> > > +		 * These int3s can result from failed breakpoint removals
> > > +		 * in kgdb_arch_remove_breakpoint.
> > > +		 */
> > > +		for (bpt = NULL, i = 0; i < KGDB_MAX_BREAKPOINTS; i++) {
> > > +			if (kgdb_break[i].bpt_addr == regs->ip &&
> > > +			    kgdb_break[i].state == BP_REMOVED &&
> > > +			    (kgdb_break[i].type == BP_BREAKPOINT ||
> > > +			     kgdb_break[i].type == BP_POKE_BREAKPOINT)) {
> > > +				bpt = &kgdb_break[i];
> > > +				break;
> > > +			}
> > > +		}
> > 
> > Seriously? The KGBD core already walked that array in
> > kgdb_isremovedbreak() just so you can walk it again here.
> > 
> > struct kkgdb_bkpt *kgdb_get_removed_breakpoint(unsigned long addr)
> > {
> >         struct kgdb_bkpt *bp = kgdb_break;
> > 
> > 	for (int i = 0; i < KGDB_MAX_BREAKPOINTS; i++, bp++) {
> > 		if (bp>.state == BP_REMOVED && bp->kgdb_bpt_addr == addr)
> > 			return bp;
> > 	}
> > 	return NULL;
> > }
> > 
> > bool kgdb_isremovedbreak(unsigned long addr)
> > {
> >         return !!kgdb_get_removed_breakpoint(addr);
> > }
> > 
> > bool kgdb_rewind_exception(int exception, struct pt_regs *regs)
> > {
> >         struct kgdb_bkpt *bp;
> > 
> > 	if (exception != 3)
> >         	return false;
> > 
> >         bp = kgdb_get_removed_breakpoint(--regs->ip);
> >         if (!bp || !bp->type == BP_BREAKPOINT)
> >         	return false;
> > 
> > Hmm?
> > 
> > > +		error = kgdb_arch_remove_breakpoint(bpt);
> > > +		if (error)
> > > +			pr_err("skipexception: breakpoint remove failed: %lx\n",
> > > +			       bpt->bpt_addr);
> > 
> > Lacks curly brackets. See Documentation.
> > 
> > 	return !error;
> > 
> > Aside of that the same problem exists on PowerPC.  So you can move the
> > attempt to remove the breakpoint into the generic code, no?
> 
> Getting the debug core to track breakpoints that are stuck on would be a
> good improvement.

Do you mean tracking failed removals with an extra kgdb_bptype variant?

> 
> We would be able to use that logic to retry the removal of stuck
> breakpoint once other SMP cores are running again. That would be great
> for architectures like arm64 that use spin-with-irqsave locking in their
> noinstr text_poke() machinery.
> 
> However it won't solve the problem for x86 since it uses mutex locking.
> 
> A partial solution might be to get kgdb_arch_remove_breakpoint() to
> disregard the text_mutex completely if kdb_continue_catastrophic is set
> (and adding stuck breakpoints to the reasons to inhibit a continue).
> This is a partial solution in the sense that it is not safe: we will
> simply tell the kernel dev driving the debugger that they are
> responsible for the safety of the continue and then disable the safety
> rails.
> 
> I haven't yet come up with a full fix that doesn't require
> text_poke_kgdb() to not require text_mutex to be free. I did note that
> comment in __text_poke() that says calling get_locked_pte() "is not
> really needed, but this allows to avoid open-coding" but I got a bit lost
> trying to figure out other locks and nesting concerns.
> 

This is a bit off-topic, but isn't setting software breakpoints in
combination with other text modifications unsafe anyway? If we remove a
breakpoint in a code location that has been modified in the meantime,
we would restore an outdated saved_instr. What am I missing here?

Best regards,
Flo