linux-kernel - Re: [PATCH 4.18 000/123] 4.18.6-stable review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ec9bdd81-8dcd-38fa-0d96-cd7b1a550e92@roeck-us.net>
Date:   Sat, 8 Sep 2018 20:58:24 -0700
From:   Guenter Roeck <linux@...ck-us.net>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, patches@...nelci.org,
        Ben Hutchings <ben.hutchings@...ethink.co.uk>,
        lkft-triage@...ts.linaro.org, stable <stable@...r.kernel.org>
Subject: Re: [PATCH 4.18 000/123] 4.18.6-stable review

On 09/05/2018 10:01 AM, Linus Torvalds wrote:
> On Wed, Sep 5, 2018 at 8:34 AM Guenter Roeck <linux@...ck-us.net> wrote:
>>
>> On 09/05/2018 02:01 AM, Greg Kroah-Hartman wrote:
>>>> ---
>>>> [ 9990.754641] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:1:155]
>>>> [ 9990.762601] RIP: 0010:smp_call_function_many+0x208/0x270
>>>> [ 9990.762601] Code: e8 0d d1 77 00 3b 05 cb f0 24 01 0f 83 86 fe ff ff 48 63 d0 49 8b 0c 24 48 03 0c d5 00 f7 11 a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d d0 4c 89 f2 4c 89 ee 44 89
> 
> It's stuck in this loop:
> 
>     loop:
>          pause
>          mov    0x18(%rcx),%edx
>          and    $0x1,%edx
>          jne    loop
> 
> which is csd_lock_wait().
> 
> Judging by the offset in smp_call_function_many(), it's the final one
> (there's two: the other one is part of "csd_lock()"). But that's just
> a guess.
> 
> Anyway, it means that we're waiting for another CPU to finish
> processing an IPI - either a previous one we sent asynchronously (if
> it's the earlier csd_lock() case) or the TLB IPI we just sent and
> we're waiting for completion of.
> 
>> Not tested, but I see it in v4.17.19 and in v4.18.6-rc2. Turns out it is
>> related to heavy load, not to suspend/resume. At this point I suspect that
>> it may be an AMD/Ryzen specific problem - it looks like it disappears if I
>> add "kernel.randomize_va_space = 0" to /etc/sysctl.conf. No idea if it is a
>> CPU bug or some AMD specific code problem. I'll try to analyze it further.
> 
> Ouch. Some IPI sending/receiving problem would be very very painful to
> debug if it's hw related.
> 

Turns out this is a well known problem with Ryzen CPUs:

https://bugzilla.kernel.org/show_bug.cgi?id=196683

Guenter