[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DB6PR05MB4597EC445D1CA5E1737C095AB2940@DB6PR05MB4597.eurprd05.prod.outlook.com>
Date: Tue, 22 May 2018 14:03:35 +0000
From: "Ofer Levi(SW)" <oferle@...lanox.com>
To: Vineet Gupta <Vineet.Gupta1@...opsys.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Meir Lichtinger" <meirl@...lanox.com>,
arcml <linux-snps-arc@...ts.infradead.org>
Subject: RE: ARC compact700 NPS platform - EZ_MachineCheck exception handler
There are two cases to consider for this exception:
> but others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?
1. Except for core 0, which is running the linux os, all other cores are running
packet processing code in ZOL isolation mode. If any of these cores hit the compact 700
0x20 exception it is logical to assume all other cores will hit it too.
It seems that eventually in any case, will have to reset HW and reboot the system.
It might be beneficial for user to try collect more info for debugging the issue even if it’s
a disaster for the system.
> Hmm, but you have to explain why those machine checks are fine !
2. The ARC compact700 instruction set was extended to support fast DMA
operations to various added HW accelerators and new asm ops to support network
packet Processing.
In case of an error, some of these instructions are wired to the 0x20 exception.
There is an HW mechanism to partition the DDR between linux os and the various accelerators
This mechanism unaware of the mmu or virtual memory handling.
In a cases where an accelerator access out of its memory bounds this exception is hit
but there is no risk to system stability. User signal handler can catch it allowing easier
debugging.
This is one example.
> > 1:
> > FAKE_RET_FROM_EXCPN
>
> You don't need this.
When removing FAKE_RET_FROM_EXCPN, first EV_MachineCheck exception
Is causing the core running that thread to stall.
If not removed multiple exceptions are handled and system seems healthy.
Please note that exception is generated by accessing one of the NPS accelerators
address which is out its memory space, so no harm is expected to system
> Next time please send a real patch so I know right away what was changed.
My apologies, here is the patch based on linux-4.16.10
diff -uprN linux-4.16.10/arch/arc/kernel/entry.S linux/arch/arc/kernel/entry.S
--- linux-4.16.10/arch/arc/kernel/entry.S 2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/entry.S 2018-05-22 14:12:18.065103918 +0300
@@ -106,13 +106,9 @@ ENTRY(EV_MachineCheck)
b ret_from_exception
1:
- ; DEAD END: can't do much, display Regs and HALT
- SAVE_CALLEE_SAVED_USER
-
- GET_CURR_TASK_FIELD_PTR TASK_THREAD, r10
- st sp, [r10, THREAD_CALLEE_REG]
-
- j do_machine_check_fault
+ FAKE_RET_FROM_EXCPN
+ bl do_machine_check
+ b ret_from_exception
END(EV_MachineCheck)
diff -uprN linux-4.16.10/arch/arc/kernel/traps.c linux/arch/arc/kernel/traps.c
--- linux-4.16.10/arch/arc/kernel/traps.c 2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/traps.c 2018-05-22 14:13:25.162748373 +0300
@@ -86,6 +86,7 @@ DO_ERROR_INFO(SIGBUS, "Invalid Mem Acces
DO_ERROR_INFO(SIGTRAP, "Breakpoint Set", trap_is_brkpt, TRAP_BRKPT)
DO_ERROR_INFO(SIGBUS, "Misaligned Access", do_misaligned_error, BUS_ADRALN)
DO_ERROR_INFO(SIGSEGV, "gcc generated __builtin_trap", do_trap5_error, 0)
+DO_ERROR_INFO(SIGBUS, "Machine Check", do_machine_check, BUS_MCEERR_AR )
/*
* Entry Point for Misaligned Data access Exception, for emulating in software
> -----Original Message-----
> From: Vineet Gupta [mailto:Vineet.Gupta1@...opsys.com]
> Sent: Monday, May 21, 2018 19:59
> To: Ofer Levi(SW) <oferle@...lanox.com>
> Cc: linux-kernel@...r.kernel.org; Meir Lichtinger <meirl@...lanox.com>;
> arcml <linux-snps-arc@...ts.infradead.org>
> Subject: Re: ARC compact700 NPS platform - EZ_MachineCheck exception
> handler
>
> On 05/21/2018 07:14 AM, Ofer Levi(SW) wrote:
> > Resending, due to typo in LKML mail address.
>
> Also please CC linux-snps-arc@...ts.infradead.org for any ARC Linux related
> posts.
>
> >
> > The EV_MachineCheck exception handler is halting the core for
> exceptions
> > which are not tlb_overlap_fault.
> > Since for the NPS platform each core is running a single thread in ZOL (Zero
> > Overhead Linux) isolation mode, we feel that most of the time it is safe to
> > resume execution instead of halting the core.
>
> Most of the time is not good enough when dealing with OS code :-( A
> Machine check excepting implies something went terribly wrong. Some of
> those cases can be handled gracefully (such as duplicate TLB entry), but
> others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?
>
> > I would appreciate it if you could review the change below
>
> Next time please send a real patch so I know right away what was changed.
>
> > and let me know
> > what you think, if this change is valid or if we missed or overlooked
> > something.
> > We are not looking to push this change upstream, but will be used on
> some
> > systems.
>
> Hmm, but you have to explain why those machine checks are fine !
>
> >
> > Please see below our implementation after label 1.
> >
> > Thanks
> > Ofer
> >
> > ENTRY(EV_MachineCheck)
> >
> > EXCEPTION_PROLOGUE
> >
> > ...
> > brne r3, ECR_C_MCHK_DUP_TLB, 1f
> >
> > bl do_tlb_overlap_fault
> > b ret_from_exception
> >
> > 1:
> > FAKE_RET_FROM_EXCPN
>
> You don't need this.
>
> > bl do_machine_check ; using DO_ERROR_INFO macro
>
> We don't have above function in code. There's do_machine_check_fault()
> which calls
> die() -> flag 1 - so it would halt the kernel and would never return here.
> So your patch is broken in implementation as well.
>
> > b ret_from_exception
> >
> > END(EV_MachineCheck)
> >
> >
Powered by blists - more mailing lists