linux-kernel - Re: sched/debug: Dump end of stack when detected corrupted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZtfNINc0hAJUtNRc@feng-clx.sh.intel.com>
Date: Wed, 4 Sep 2024 10:59:44 +0800
From: Feng Tang <feng.tang@...el.com>
To: John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
	<peterz@...radead.org>, <mingo@...hat.com>
CC: <akpm@...ux-foundation.org>, <bristot@...hat.com>, <bsegall@...gle.com>,
	<dietmar.eggemann@....com>, <juri.lelli@...hat.com>,
	<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
	<peterz@...radead.org>, <rostedt@...dmis.org>, <vbabka@...e.cz>,
	<vincent.guittot@...aro.org>, <vschneid@...hat.com>,
	<sparclinux@...r.kernel.org>
Subject: Re: sched/debug: Dump end of stack when detected corrupted

Hi Adrian, 

On Tue, Sep 03, 2024 at 06:33:55PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Feng,
> 
> > When debugging a kernel hang during suspend/resume, there are random
> > memory corruptions in different places like being detected by scheduler
> > with error message:
> > 
> >   "Kernel panic - not syncing: corrupted stack end detected inside scheduler"
> > 
> > Dump the corrupted memory around the stack end will give more direct
> > hints about how the memory is corrupted:
> > 
> >  "
> >  Corrupted Stack: ff11000122770000: ff ff ff ff ff ff 14 91 82 3b 78 e8 08 00 45 00  .........;x...E.
> >  Corrupted Stack: ff11000122770010: 00 1d 2a ff 40 00 40 11 98 c8 0a ef 30 2c 0a ef  ..*.@.@.....0,..
> >  Corrupted Stack: ff11000122770020: 30 ff a2 00 22 3d 00 09 9a 95 2a 00 00 00 00 00  0..."=....*.....
> >  ...
> >  Kernel panic - not syncing: corrupted stack end detected inside scheduler
> >  "
> > 
> > And with it, the culprit was quickly identified to be an ethernet
> > driver with its DMA operations.
> > 
> > Signed-off-by: Feng Tang <feng.tang@...el.com>
> > ---
> >  kernel/sched/core.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a795e030678c..1280f7012bc5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5949,8 +5949,18 @@ static noinline void __schedule_bug(struct task_struct *prev)
> >  static inline void schedule_debug(struct task_struct *prev, bool preempt)
> >  {
> >  #ifdef CONFIG_SCHED_STACK_END_CHECK
> > -	if (task_stack_end_corrupted(prev))
> > +	if (task_stack_end_corrupted(prev)) {
> > +		unsigned long *ptr = end_of_stack(prev);
> > +
> > +		/* Dump 16 ulong words around the corruption point */
> > +#ifdef CONFIG_STACK_GROWSUP
> > +		ptr -= 15;
> > +#endif
> > +		print_hex_dump(KERN_ERR, "Corrupted Stack: ",
> > +			DUMP_PREFIX_ADDRESS, 16, 1, ptr, 16 * sizeof(*ptr), 1);
> > +
> >  		panic("corrupted stack end detected inside scheduler\n");
> > +	}
> >  
> >  	if (task_scs_end_corrupted(prev))
> >  		panic("corrupted shadow stack detected inside scheduler\n");
> 
> Have you gotten any feedback on this? Would be nice to get this merged as we're
> seeing crashes due to stack corruption on sparc from time to time and having the
> end of the stack dumped in such cases would make debugging here a bit easier.

Thanks for the review and providing feedback! So far I haven't got response
from maintainers yet.

Hi Peter and maintainers,

Could you help to review this patch which can help debugging those naughty
memory corruption issues? Thanks!

There is a v2 version which can be applied to latest linux-next branch:
https://lore.kernel.org/lkml/20240207143523.438816-1-feng.tang@intel.com/

- Feng