[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z45XFfduLMkp5iga@home.paul.comp>
Date: Mon, 20 Jan 2025 17:00:53 +0300
From: Paul Fertser <fercerpav@...il.com>
To: Eddie James <eajames@...ux.ibm.com>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, horms@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, davem@...emloft.net, sam@...dozajonas.com,
Ivan Mikhaylov <fr0st61te@...il.com>
Subject: Re: [PATCH] net/ncsi: Fix NULL pointer derefence if CIS arrives
before SP
Hi Eddie,
Thank you for testing the patch! More inline.
On Fri, Jan 17, 2025 at 03:05:24PM -0600, Eddie James wrote:
> > > On Fri, 10 Jan 2025 13:41:33 -0600 Eddie James wrote:
> > > > If a Clear Initial State response packet is received before the
> > > > Select Package response, then the channel set up will dereference
> > > > the NULL package pointer. Fix this by setting up the package
> > > > in the CIS handler if it's not found.
> >
> > My current notion is that the responses can't normally be re-ordered
> > (as we are supposed to send the next command only after receiving
> > response for the previous one) and so any surprising event like that
> > signifies that the FSM got out of sync (unfortunately it's written in
> > such a way that it switches to the "next state" based on the quantity
> > of responses the current state expected, not on the actual content of
> > them; that's rather fragile).
> >
> > Sending the "Select Package" command is the first thing that is
> > performed after package discovery is complete so problems in that area
> > suggest that the reason might be lack of processing for the response
> > to the last "Package Deselect" command: receiving it would advance the
> > state machine prematurely. It's not quite clear to me how the SP
> > response can be lost altogether or what else happens there in the
> > failure case, unfortunately it's not reproducible on my system so I
> > can't just add more debugging to see all responses and state
> > transitions as they happen.
> >
> > Eddie, how easy is it to reproduce the issue in your setup? Can you
> > please try if the change in [0] makes a difference?
>
> I am able to reproduce the panic at will, and unfortunately your patch does
> not prevent the issue.
>
> However I suspect this issue may be unique to my set up, so my patch may not
> be necessary. I found that I had some user space issues. Fixing userspace
> prevented this issue.
That's an interesting observation. Sounds like you're probably sending
some NCSI commands via netlink in parallel with the in-kernel
configuration process (this detail wasn't at all obvious from the
commit message) and that races somehow.
But in any case userspace shouldn't be able to crash the kernel, and
responses to netlink-initiated communication should be going back to
netlink rather than getting handled by the ncsi_rsp_handler_* code.
So there must be some insufficient locking or a logic error somewhere
worth fixing, especially since you're able to reproduce.
--
Be free, use free (http://www.gnu.org/philosophy/free-sw.html) software!
mailto:fercerpav@...il.com
Powered by blists - more mailing lists