linux-kernel - Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <74b2262b-48e7-b2a5-7d20-dc7f590958d9@linux.vnet.ibm.com>
Date:   Mon, 14 Jan 2019 12:11:44 -0600
From:   Michael Bringmann <mwb@...ux.vnet.ibm.com>
To:     ego@...ux.vnet.ibm.com,
        Thiago Jung Bauermann <bauerman@...ux.ibm.com>
Cc:     linux-kernel@...r.kernel.org, Nicholas Piggin <npiggin@...il.com>,
        Tyrel Datwyler <tyreld@...ux.vnet.ibm.com>,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while
 waiting for rtas-stop

On 1/9/19 12:08 AM, Gautham R Shenoy wrote:

> I did some testing during the holidays. Here are the observations:
> 
> 1) With just your patch (without any additional debug patch), if I run
> DLPAR on /off operations on a system that has SMT=off, I am able to
> see a crash involving RTAS stack corruption within an hour's time.
> 
> 2) With the debug patch (appended below) which has additional debug to
> capture the callers of stop-self, start-cpu, set-power-levels, the
> system is able to perform DLPAR on/off operations on a system with
> SMT=off for three days. And then, it crashed with the dead CPU showing
> a "Bad kernel stack pointer". From this log, I can clearly
> see that there were no concurrent calls to stop-self, start-cpu,
> set-power-levels. The only concurrent RTAS calls were the dying CPU
> calling "stop-self", and the CPU running the DLPAR operation calling
> "query-cpu-stopped-state". The crash signature is appended below as
> well.
> 
> 3) Modifying your patch to remove the udelay and increase the loop
> count from 25 to 1000 doesn't improve the situation. We are still able
> to see the crash.
> 
> 4) With my patch, even without any additional debug, I was able to
> observe the system run the tests successfully for over a week (I
> started the tests before the Christmas weekend, and forgot to turn it
> off!)

So does this mean that the problem is fixed with your patch?

> 
> It appears that there is a narrow race window involving rtas-stop-self
> and rtas-query-cpu-stopped-state calls that can be observed with your
> patch. Adding any printk's seems to reduce the probability of hitting
> this race window. It might be worth the while to check with RTAS
> folks, if they suspect something here.

What would the RTAS folks be looking at here?  The 'narrow race window'
is with respect to a patch that it sound like we should not be using.

Thanks.
Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@...ux.vnet.ibm.com