linux-kernel - Re: [PATCH] trace: skip hwasan

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aa6ef1c4-5073-4748-0c3c-b2d10c5bcdb7@arm.com>
Date:   Thu, 21 Feb 2019 14:19:19 +0000
From:   James Morse <james.morse@....com>
To:     Will Deacon <will.deacon@....com>,
        Dmitry Vyukov <dvyukov@...gle.com>
Cc:     Qian Cai <cai@....pw>, Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...hat.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Andrey Konovalov <andreyknvl@...gle.com>,
        Andrey Ryabinin <aryabinin@...tuozzo.com>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        kasan-dev <kasan-dev@...glegroups.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] trace: skip hwasan

Hi!

On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]

Decode is a strong term!

This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the
bits that vary between the out we've seen!


> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@....pw> wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
>>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@....pw> wrote:
>>>>>
>>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>>>>> because there is a burst of too much pointer access, and then KASAN will
>>>>> dereference each byte of the shadow address for the tag checking which
>>>>> will kill all the CPUs.
>>>>
>>>> Could you please elaborate what exactly happens and who/why kills
>>>> CPUs? Number of memory accesses should not make any difference.
>>>> With hardware support (MTE) it won't be possible to disable
>>>> instrumentation (loads and stores check tags themselves), so it would
>>>> be useful to keep track of exact reasons we disable instrumentation to
>>>> know how to deal with them with hardware support.
>>>> It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).

> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.

Not quite like this. I've had one choke on some PCIe transaction[0].

This looks like corruption detected in a cache associated with a CPU. 'Write back' and
'Physical Address' suggests its the data cache:


>>>  Node 0 NBU 0 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>>         Agent info : CPU
>>>             Core ID : 21
>>>             Thread ID : 1
>>>         Requ: type : 4 : Write Back
>>>  Node 0 NBU 1 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>>         Agent info : CPU
>>>             Core ID : 21
>>>             Thread ID : 1
>>>         Requ: type : 4 : Write Back
>>>  Node 0 NBU 2 Error report :
>>>  NBU BAR Error
[..]
>>>       Physical Address : 0x40011ff80

If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing
all the threads/CPUs in that core. It may be one cache is close to some threshold, and you
can offline the core that its part of.


Thanks,

James


[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
#        Agent info : IO
#                   : PCIE0
#        Requ: type : 2 : Read