[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <170fa0d20803270621k7723ae47n337011beafe87cdb@mail.gmail.com>
Date: Thu, 27 Mar 2008 09:21:23 -0400
From: "Mike Snitzer" <snitzer@...il.com>
To: "Paul Clements" <paul.clements@...eleye.com>
Cc: nbd-general@...ts.sourceforge.net, linux-kernel@...r.kernel.org
Subject: Re: nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock_xmit() is working on a receive
On Thu, Mar 27, 2008 at 8:35 AM, Paul Clements
<paul.clements@...eleye.com> wrote:
> Mike Snitzer wrote:
>
> > In practice this looks like:
> >
> > nbd1: NBD_DISCONNECT
> > nbd1: Send control failed (result -32)
> > end_request: I/O error, dev nbd1, sector 0
> > end_request: I/O error, dev nbd1, sector 8032264
> > md: super_written gets error=-5, uptodate=0
> > raid1: Disk failure on nbd1, disabling device.
> > Operation continuing on 1 devices
> > Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
> > [<ffffffff88b1e125>] :nbd:sock_xmit+0x9d/0x301
>
> > The fact that sock_xmit() in receive mode is unprotected seems to be
> > the WHY a NULL pointer is possible; but I'm still trying to identify
> > the HOW.
>
> Do you know who is setting the socket NULL? Is it already NULL when you
> get to this point? Is it the nbd-client -d? Is it the original
> nbd-client/kernel that does it? Figuring that out would help narrow down
> the cause.
I believe that NBD_CLEAR_SOCK from 'nbd-client -d' sets it to NULL.
lo->sock is already NULL on entry to sock_xmit().
So simply checking if the sock_xmit's 'sock' is NULL _should_ avoid
any possibility of a NULL pointer Oops because sock can't be !NULL
after the negative check (because of the sock = lo->sock assignment).
That is, unless I'm missing somewhere in the rest of the kernel (not
nbd) that would take action to set a socket to NULL?
The attached patch seems reasonable. I'll be testing today to verify
it fixes the problem.
> > But for me this begs the question: why isn't the nbd_device's socket
> > always protected during sock_xmit() for both
> > transmits and receives; rather than just transmits (via tx_lock)!?
>
> It would deadlock if we held the lock over both. Generally we don't have
> to worry about receives, since they're always done in the nbd-client
> process, so we have control over when and how it exits and cleans up.
> The odd case, as you've discovered, is when another process (nbd-client
> -d) comes along and starts mucking with the queue and socket. Would
> "kill -9 <nbd-client-pid>" work for you instead? That is what I use to
> break the connection, and it's safe, as it tells the original nbd-client
> to exit (which it does cleanly and safely).
I'm aware tx_lock can't be held over both; I was suggesting maybe
another lock but that feels like overkill.
I use 'nbd-client -d' and then resort to 'kill -9' IFF 'nbd-client -d'
returned non-zero.
But it sounds like simply using 'kill -9' could be a near-term
workaround, I'll try this as well and will report back.
thanks,
Mike
View attachment "nbd_sock_xmit_oops.patch" of type "text/x-patch" (611 bytes)
Powered by blists - more mailing lists