[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190117060317.GA6897@in.ibm.com>
Date: Thu, 17 Jan 2019 11:33:17 +0530
From: Gautham R Shenoy <ego@...ux.vnet.ibm.com>
To: Michael Bringmann <mwb@...ux.vnet.ibm.com>
Cc: ego@...ux.vnet.ibm.com,
Thiago Jung Bauermann <bauerman@...ux.ibm.com>,
linux-kernel@...r.kernel.org, Nicholas Piggin <npiggin@...il.com>,
Tyrel Datwyler <tyreld@...ux.vnet.ibm.com>,
linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while
waiting for rtas-stop
Hello Michael,
On Mon, Jan 14, 2019 at 12:11:44PM -0600, Michael Bringmann wrote:
> On 1/9/19 12:08 AM, Gautham R Shenoy wrote:
>
> > I did some testing during the holidays. Here are the observations:
> >
> > 1) With just your patch (without any additional debug patch), if I run
> > DLPAR on /off operations on a system that has SMT=off, I am able to
> > see a crash involving RTAS stack corruption within an hour's time.
> >
> > 2) With the debug patch (appended below) which has additional debug to
> > capture the callers of stop-self, start-cpu, set-power-levels, the
> > system is able to perform DLPAR on/off operations on a system with
> > SMT=off for three days. And then, it crashed with the dead CPU showing
> > a "Bad kernel stack pointer". From this log, I can clearly
> > see that there were no concurrent calls to stop-self, start-cpu,
> > set-power-levels. The only concurrent RTAS calls were the dying CPU
> > calling "stop-self", and the CPU running the DLPAR operation calling
> > "query-cpu-stopped-state". The crash signature is appended below as
> > well.
> >
> > 3) Modifying your patch to remove the udelay and increase the loop
> > count from 25 to 1000 doesn't improve the situation. We are still able
> > to see the crash.
> >
> > 4) With my patch, even without any additional debug, I was able to
> > observe the system run the tests successfully for over a week (I
> > started the tests before the Christmas weekend, and forgot to turn it
> > off!)
>
> So does this mean that the problem is fixed with your patch?
No. On the contrary I think my patch unable to exploit the possible
race window. From a technical point of view, Thiago's patch does the
right things
- It waits for the target CPU to come out of CEDE and set its state
to CPU_STATE_OFFLINE.
- Only then it then makes the "query-cpu-stopped-state" rtas call in
a loop, while giving sufficient delay between successive
queries. This reduces the unnecessary rtas-calls.
In my patch, I don't do any of this, but simply keep calling
"query-cpu-stopped-state" call in a loop for 4000 iterations.
That said, if I incorporate the wait for the target CPU to set its
state to CPU_STATE_OFFLINE in my patch and then make the
"query-cpu-stopped-state" rtas call, then, even I am able to get the
crash with the "RTAS CALL BUFFER CORRUPTION" message.
I think that incorporating the wait for the target CPU set its state
to CPU_STATE_OFFLINE increases the probability that
"query-cpu-stopped-state" and "stop-self" rtas calls get called more
or less at the same time. However since concurrent invocations of
these rtas-calls is allowed by the PAPR, it should not result in a
"RTAS CALL BUFFER CORRUPTION". Am I missing something here ?
>
> >
> > It appears that there is a narrow race window involving rtas-stop-self
> > and rtas-query-cpu-stopped-state calls that can be observed with your
> > patch. Adding any printk's seems to reduce the probability of hitting
> > this race window. It might be worth the while to check with RTAS
> > folks, if they suspect something here.
>
> What would the RTAS folks be looking at here? The 'narrow race window'
> is with respect to a patch that it sound like we should not be
> using.
IMHO, if the race-window exists, it would be better to confirm and fix
it, given that we have a patch that can exploit it consistently.
>
> Thanks.
> Michael
>
> --
> Michael W. Bringmann
> Linux Technology Center
> IBM Corporation
> Tie-Line 363-5196
> External: (512) 286-5196
> Cell: (512) 466-0650
> mwb@...ux.vnet.ibm.com
--
Thanks and Regards
gautham.
Powered by blists - more mailing lists