[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOhMmr4dOvA8O8Y_H7z6D+QPNVwHq1D0z3e=h75QdPb9JR=3Rg@mail.gmail.com>
Date: Wed, 23 Dec 2020 14:10:32 -0600
From: Lijun Pan <lijunp213@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Lijun Pan <ljp@...ux.ibm.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net] ibmvnic: continue fatal error reset after passive init
On Wed, Dec 23, 2020 at 10:50 AM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote:
> > On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:
> > > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > > > says "If the passive
> > > > CRQ initialization occurs before the FATAL reset task is processed,
> > > > the FATAL error reset task would try to access a CRQ message queue
> > > > that was freed, causing an oops. The problem may be most likely to
> > > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > > > process will automatically issue a change MTU request.
> > > > Fix this by not processing fatal error reset if CRQ is passively
> > > > initialized after client-driven CRQ initialization fails."
> > > >
> > > > Even with this commit, we still see similar kernel crashes. In order
> > > > to completely solve this problem, we'd better continue the fatal error
> > > > reset, capture the kernel crash, and try to fix it from that end.
> > >
> > > This basically reverts the quoted fix. Does the quoted fix make things
> > > worse? Otherwise we should leave the code be until proper fix is found.
> >
> > Yes, I think the quoted commit makes things worse. It skips the specific
> > reset condition, but that does not fix the problem it claims to fix.
>
> Okay, let's make sure the commit message explains how it makes things
> worse.
I will reword the commit message.
>
> > The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
> > think reverting it to the original "else" condition is the right thing to do.
>
> Hm. So the problem is fixed? But the commit message says "we still see
> similar kernel crashes", that's present tense suggesting that crashes
> are seen on current net/master. Are you saying that's not the case and
> after 0e435befaea4 and a0faaa27c716 there are no more crashes?
This patch was formed before I submitted 0e435befaea4 and a0faaa27c716, so
I used the wording "we still see similar kernel crashes". I will modify
the commit message before I submit v2 of this patch.
After 0e435befaea4 and a0faaa27c716, I don't see any crashes as described
in this quoted commit even without this quoted commit.
That's why I am sure this quoted commit does not fix the described problem
and I want to revert it.
Powered by blists - more mailing lists