linux-kernel - Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <f93ec711-a7c6-e754-2002-2dad2a893005@linux.vnet.ibm.com>
Date:   Tue, 11 Dec 2018 16:11:28 -0600
From:   Michael Bringmann <mwb@...ux.vnet.ibm.com>
To:     Thiago Jung Bauermann <bauerman@...ux.ibm.com>
Cc:     ego@...ux.vnet.ibm.com, linux-kernel@...r.kernel.org,
        Nicholas Piggin <npiggin@...il.com>,
        Tyrel Datwyler <tyreld@...ux.vnet.ibm.com>,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while
 waiting for rtas-stop

Note from Scott Mayes on latest crash:

Michael,

Since the partition crashed, I was able to get the last .2 seconds worth of RTAS call trace leading up to the crash.

Best I could tell from that bit of trace was that the removal of a processor involved the following steps:
-- Call to stop-self for a given thread
-- Repeated calls to query-cpu-stopped-state (which eventually indicated the thread was stopped)
-- Call to get-sensor-state for the thread to check its entity-state (9003) sensor which returned 'dr-entity-present'
-- Call to set-indicator to set the isolation-state (9001) indicator to ISOLATE state
-- Call to set-indicator to set the allocation-state (9003) indicator to UNUSABLE state

I noticed one example of thread x28 getting through all of these steps just fine, but for thread x20, although the
query-cpu-stopped state returned 0 status (STOPPED), a subsequent call to set-indicator to ISOLATE
failed.  This failure was near the end of the trace, but was not the very last RTAS call made in the trace.
The set-indicator failure reported to Linux was a -9001 (Valid outstanding translation) which was mapped
from a 0x502 (Invalid thread state) return code from PHYP's H_SET_DR_STATE h-call.

On 12/10/2018 02:31 PM, Thiago Jung Bauermann wrote:
> 
> Hello Michael,
> 
> Michael Bringmann <mwb@...ux.vnet.ibm.com> writes:
> 
>> I have asked Scott Mayes to take a look at one of these crashes from
>> the phyp side.  I will let you know if he finds anything notable.
> 
> Thanks! It might make sense to test whether booting with
> cede_offline=off makes the bug go away.

Scott is looking at the system.  I will try once he is finished.

> 
> One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
>>>From what I understand, it is a powerpc-specific CPU state and from the
> perspective of the generic CPU hotplug state machine, inactive CPUs are
> already fully offline. Which means that the locking performed by the
> generic code state machine doesn't apply to transitioning CPUs from
> INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
> one CPU making that transition at the same time? That would cause two
> CPUs to call RTAS stop-self.
> 
> I haven't checked whether this is really possible or not, though. It's
> just a conjecture.

Michael

> 
> --
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@...ux.vnet.ibm.com