lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20181207104311.GA11431@in.ibm.com>
Date:   Fri, 7 Dec 2018 16:13:11 +0530
From:   Gautham R Shenoy <ego@...ux.vnet.ibm.com>
To:     Thiago Jung Bauermann <bauerman@...ux.ibm.com>
Cc:     "Gautham R. Shenoy" <ego@...ux.vnet.ibm.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Tyrel Datwyler <tyreld@...ux.vnet.ibm.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>,
        Michael Bringmann <mwb@...ux.vnet.ibm.com>,
        linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while
 waiting for rtas-stop

Hi Thiago,

On Thu, Dec 06, 2018 at 03:28:17PM -0200, Thiago Jung Bauermann wrote:

[..snip..]

> 
> 
> I posted a similar patch last year, but I wasn't able to arrive at a
> root cause analysis like you did:
> 
>
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

Ah! Nice. So this is a known problem.

> 
> One thing I realized after I posted the patch was that in my case, the
> CPU was crashing inside RTAS. From the NIP and LR in the trace above it
> looks like it's crashing in RTAS in your case as well.
> 
> Michael Ellerman had two comments on my patch:
> 
> 1. Regardless of the underlying bug, the kernel shouldn't crash so we
> need a patch making it more resilient to this failure.
> 
> 2. The wait loop should use udelay() so that the loop will actually take
> a set amount of wall time, rather than just cycles.
> 
> Regarding 1. if the problem is that the kernel is causing RTAS to crash
> because it calls it in a way that's unsupported, then I don't see how we
> can make the kernel more resilient. We have to make the kernel respect
> RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
> gracefuly in these conditions).


I agree that the Kernel has to respect RTAS's restriction. The PAPR
v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:

    "The stop-self service needs to be serialized with calls to the
     stop-self, start-cpu, and set-power-level services. The OS must
     be able to call RTAS services on other processors while the
     processor is stopped or being stopped"

Thus the onus is on the OS to ensure that there are no concurrent rtas
calls with "stop-self" token.

> 
> Regarding 2. I implemented a new version of my patch (posted below) but
> I was never able to test it because I couldn't access a system where the
> problem was reproducible anymore.
> 
> There's also a race between the CPU driving the unplug and the CPU being
> unplugged which I think is not easy for the CPU being unplugged to win,
> which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
> the race in the patch description.
> 
> My solution to make the race less tight is to make the CPU driving the
> unplug to only start the busy loop only after the CPU being unplugged is
> in the CPU_STATE_OFFLINE state. At that point, we know that it either is
> about to call RTAS or it already has.

Ah, yes this is good optimization. Though, I think we ought to
unconditionally wait until the target CPU has woken up from CEDE and
changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
would have caught it in dlpar_offline_cpu() itself.

> 
> Do you think this makes sense? If you do, would you mind testing my
> patch? You can change the timeouts and delays if you want. To be honest
> they're just guesses on my part...

Sure. I will test the patch and report back.

--
Thanks and Regards
gautham.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ