[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aa6ef1c4-5073-4748-0c3c-b2d10c5bcdb7@arm.com>
Date: Thu, 21 Feb 2019 14:19:19 +0000
From: James Morse <james.morse@....com>
To: Will Deacon <will.deacon@....com>,
Dmitry Vyukov <dvyukov@...gle.com>
Cc: Qian Cai <cai@....pw>, Steven Rostedt <rostedt@...dmis.org>,
Ingo Molnar <mingo@...hat.com>,
Catalin Marinas <catalin.marinas@....com>,
Andrey Konovalov <andreyknvl@...gle.com>,
Andrey Ryabinin <aryabinin@...tuozzo.com>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
kasan-dev <kasan-dev@...glegroups.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] trace: skip hwasan
Hi!
On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]
Decode is a strong term!
This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the
bits that vary between the out we've seen!
> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@....pw> wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
>>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@....pw> wrote:
>>>>>
>>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>>>>> because there is a burst of too much pointer access, and then KASAN will
>>>>> dereference each byte of the shadow address for the tag checking which
>>>>> will kill all the CPUs.
>>>>
>>>> Could you please elaborate what exactly happens and who/why kills
>>>> CPUs? Number of memory accesses should not make any difference.
>>>> With hardware support (MTE) it won't be possible to disable
>>>> instrumentation (loads and stores check tags themselves), so it would
>>>> be useful to keep track of exact reasons we disable instrumentation to
>>>> know how to deal with them with hardware support.
>>>> It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).
> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.
Not quite like this. I've had one choke on some PCIe transaction[0].
This looks like corruption detected in a cache associated with a CPU. 'Write back' and
'Physical Address' suggests its the data cache:
>>> Node 0 NBU 0 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 1 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 2 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff80
If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing
all the threads/CPUs in that core. It may be one cache is close to some threshold, and you
can offline the core that its part of.
Thanks,
James
[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
# Agent info : IO
# : PCIE0
# Requ: type : 2 : Read
Powered by blists - more mailing lists