linux-kernel - Re: [PATCH 4.18 000/123] 4.18.6-stable review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFzpnXYHNeog==-aWYO=V_67viGrMah1ZSdcDqcRO214-w@mail.gmail.com>
Date:   Wed, 5 Sep 2018 10:01:12 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Guenter Roeck <linux@...ck-us.net>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>, patches@...nelci.org,
        Ben Hutchings <ben.hutchings@...ethink.co.uk>,
        lkft-triage@...ts.linaro.org, stable <stable@...r.kernel.org>
Subject: Re: [PATCH 4.18 000/123] 4.18.6-stable review

On Wed, Sep 5, 2018 at 8:34 AM Guenter Roeck <linux@...ck-us.net> wrote:
>
> On 09/05/2018 02:01 AM, Greg Kroah-Hartman wrote:
> >> ---
> >> [ 9990.754641] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:1:155]
> >> [ 9990.762601] RIP: 0010:smp_call_function_many+0x208/0x270
> >> [ 9990.762601] Code: e8 0d d1 77 00 3b 05 cb f0 24 01 0f 83 86 fe ff ff 48 63 d0 49 8b 0c 24 48 03 0c d5 00 f7 11 a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d d0 4c 89 f2 4c 89 ee 44 89

It's stuck in this loop:

   loop:
        pause
        mov    0x18(%rcx),%edx
        and    $0x1,%edx
        jne    loop

which is csd_lock_wait().

Judging by the offset in smp_call_function_many(), it's the final one
(there's two: the other one is part of "csd_lock()"). But that's just
a guess.

Anyway, it means that we're waiting for another CPU to finish
processing an IPI - either a previous one we sent asynchronously (if
it's the earlier csd_lock() case) or the TLB IPI we just sent and
we're waiting for completion of.

> Not tested, but I see it in v4.17.19 and in v4.18.6-rc2. Turns out it is
> related to heavy load, not to suspend/resume. At this point I suspect that
> it may be an AMD/Ryzen specific problem - it looks like it disappears if I
> add "kernel.randomize_va_space = 0" to /etc/sysctl.conf. No idea if it is a
> CPU bug or some AMD specific code problem. I'll try to analyze it further.

Ouch. Some IPI sending/receiving problem would be very very painful to
debug if it's hw related.

              Linus