[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtfNINc0hAJUtNRc@feng-clx.sh.intel.com>
Date: Wed, 4 Sep 2024 10:59:44 +0800
From: Feng Tang <feng.tang@...el.com>
To: John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
<peterz@...radead.org>, <mingo@...hat.com>
CC: <akpm@...ux-foundation.org>, <bristot@...hat.com>, <bsegall@...gle.com>,
<dietmar.eggemann@....com>, <juri.lelli@...hat.com>,
<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
<peterz@...radead.org>, <rostedt@...dmis.org>, <vbabka@...e.cz>,
<vincent.guittot@...aro.org>, <vschneid@...hat.com>,
<sparclinux@...r.kernel.org>
Subject: Re: sched/debug: Dump end of stack when detected corrupted
Hi Adrian,
On Tue, Sep 03, 2024 at 06:33:55PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Feng,
>
> > When debugging a kernel hang during suspend/resume, there are random
> > memory corruptions in different places like being detected by scheduler
> > with error message:
> >
> > "Kernel panic - not syncing: corrupted stack end detected inside scheduler"
> >
> > Dump the corrupted memory around the stack end will give more direct
> > hints about how the memory is corrupted:
> >
> > "
> > Corrupted Stack: ff11000122770000: ff ff ff ff ff ff 14 91 82 3b 78 e8 08 00 45 00 .........;x...E.
> > Corrupted Stack: ff11000122770010: 00 1d 2a ff 40 00 40 11 98 c8 0a ef 30 2c 0a ef ..*.@.@.....0,..
> > Corrupted Stack: ff11000122770020: 30 ff a2 00 22 3d 00 09 9a 95 2a 00 00 00 00 00 0..."=....*.....
> > ...
> > Kernel panic - not syncing: corrupted stack end detected inside scheduler
> > "
> >
> > And with it, the culprit was quickly identified to be an ethernet
> > driver with its DMA operations.
> >
> > Signed-off-by: Feng Tang <feng.tang@...el.com>
> > ---
> > kernel/sched/core.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a795e030678c..1280f7012bc5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5949,8 +5949,18 @@ static noinline void __schedule_bug(struct task_struct *prev)
> > static inline void schedule_debug(struct task_struct *prev, bool preempt)
> > {
> > #ifdef CONFIG_SCHED_STACK_END_CHECK
> > - if (task_stack_end_corrupted(prev))
> > + if (task_stack_end_corrupted(prev)) {
> > + unsigned long *ptr = end_of_stack(prev);
> > +
> > + /* Dump 16 ulong words around the corruption point */
> > +#ifdef CONFIG_STACK_GROWSUP
> > + ptr -= 15;
> > +#endif
> > + print_hex_dump(KERN_ERR, "Corrupted Stack: ",
> > + DUMP_PREFIX_ADDRESS, 16, 1, ptr, 16 * sizeof(*ptr), 1);
> > +
> > panic("corrupted stack end detected inside scheduler\n");
> > + }
> >
> > if (task_scs_end_corrupted(prev))
> > panic("corrupted shadow stack detected inside scheduler\n");
>
> Have you gotten any feedback on this? Would be nice to get this merged as we're
> seeing crashes due to stack corruption on sparc from time to time and having the
> end of the stack dumped in such cases would make debugging here a bit easier.
Thanks for the review and providing feedback! So far I haven't got response
from maintainers yet.
Hi Peter and maintainers,
Could you help to review this patch which can help debugging those naughty
memory corruption issues? Thanks!
There is a v2 version which can be applied to latest linux-next branch:
https://lore.kernel.org/lkml/20240207143523.438816-1-feng.tang@intel.com/
- Feng
Powered by blists - more mailing lists