netdev - Re: [PATCH net] ibmvnic: continue fatal error reset after passive init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20201223122407.5f0b8b47@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Wed, 23 Dec 2020 12:24:07 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Lijun Pan <lijunp213@...il.com>
Cc:     Lijun Pan <ljp@...ux.ibm.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net] ibmvnic: continue fatal error reset after passive
 init

On Wed, 23 Dec 2020 14:10:32 -0600 Lijun Pan wrote:
> On Wed, Dec 23, 2020 at 10:50 AM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > On Wed, 23 Dec 2020 02:21:09 -0600 Lijun Pan wrote:  
> > > On Tue, Dec 22, 2020 at 8:48 PM Jakub Kicinski <kuba@...nel.org> wrote:  
> > > > On Sat, 19 Dec 2020 15:40:34 -0600 Lijun Pan wrote:  
> > > > > Commit f9c6cea0b385 ("ibmvnic: Skip fatal error reset after passive init")
> > > > > says "If the passive
> > > > > CRQ initialization occurs before the FATAL reset task is processed,
> > > > > the FATAL error reset task would try to access a CRQ message queue
> > > > > that was freed, causing an oops. The problem may be most likely to
> > > > > occur during DLPAR add vNIC with a non-default MTU, because the DLPAR
> > > > > process will automatically issue a change MTU request.
> > > > > Fix this by not processing fatal error reset if CRQ is passively
> > > > > initialized after client-driven CRQ initialization fails."
> > > > >
> > > > > Even with this commit, we still see similar kernel crashes. In order
> > > > > to completely solve this problem, we'd better continue the fatal error
> > > > > reset, capture the kernel crash, and try to fix it from that end.  
> > > >
> > > > This basically reverts the quoted fix. Does the quoted fix make things
> > > > worse? Otherwise we should leave the code be until proper fix is found.  
> > >
> > > Yes, I think the quoted commit makes things worse. It skips the specific
> > > reset condition, but that does not fix the problem it claims to fix.  
> >
> > Okay, let's make sure the commit message explains how it makes things
> > worse.  
> 
> I will reword the commit message.
> 
> > > The effective fix is upstream SHA 0e435befaea4 and a0faaa27c716. So I
> > > think reverting it to the original "else" condition is the right thing to do.  
> >
> > Hm. So the problem is fixed? But the commit message says "we still see
> > similar kernel crashes", that's present tense suggesting that crashes
> > are seen on current net/master. Are you saying that's not the case and
> > after 0e435befaea4 and a0faaa27c716 there are no more crashes?  
> 
> This patch was formed before I submitted 0e435befaea4 and a0faaa27c716, so
> I used the wording "we still see similar kernel crashes". I will modify
> the commit message before I submit v2 of this patch.
> After 0e435befaea4 and a0faaa27c716, I don't see any crashes as described
> in this quoted commit even without this quoted commit.
> That's why I am sure this quoted commit does not fix the described problem
> and I want to revert it.

I see, that explains it!