linux-kernel - Re: nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <170fa0d20803270621k7723ae47n337011beafe87cdb@mail.gmail.com>
Date:	Thu, 27 Mar 2008 09:21:23 -0400
From:	"Mike Snitzer" <snitzer@...il.com>
To:	"Paul Clements" <paul.clements@...eleye.com>
Cc:	nbd-general@...ts.sourceforge.net, linux-kernel@...r.kernel.org
Subject: Re: nbd: Oops because nbd doesn't prevent NBD_CLEAR_SOCK while sock_xmit() is working on a receive

On Thu, Mar 27, 2008 at 8:35 AM, Paul Clements
<paul.clements@...eleye.com> wrote:
> Mike Snitzer wrote:
>
>  > In practice this looks like:
>  >
>  > nbd1: NBD_DISCONNECT
>  > nbd1: Send control failed (result -32)
>  > end_request: I/O error, dev nbd1, sector 0
>  > end_request: I/O error, dev nbd1, sector 8032264
>  > md: super_written gets error=-5, uptodate=0
>  > raid1: Disk failure on nbd1, disabling device.
>  >         Operation continuing on 1 devices
>  > Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
>  >  [<ffffffff88b1e125>] :nbd:sock_xmit+0x9d/0x301
>
>  > The fact that sock_xmit() in receive mode is unprotected seems to be
>  > the WHY a NULL pointer is possible; but I'm still trying to identify
>  > the HOW.
>
>  Do you know who is setting the socket NULL? Is it already NULL when you
>  get to this point? Is it the nbd-client -d? Is it the original
>  nbd-client/kernel that does it? Figuring that out would help narrow down
>  the cause.

I believe that NBD_CLEAR_SOCK from 'nbd-client -d' sets it to NULL.
lo->sock is already NULL on entry to sock_xmit().

So simply checking if the sock_xmit's 'sock' is NULL _should_ avoid
any possibility of a NULL pointer Oops because sock can't be !NULL
after the negative check (because of the sock = lo->sock assignment).
That is, unless I'm missing somewhere in the rest of the kernel (not
nbd) that would take action to set a socket to NULL?

The attached patch seems reasonable.  I'll be testing today to verify
it fixes the problem.

>  > But for me this begs the question:  why isn't the nbd_device's socket
>  > always protected during sock_xmit() for both
>  > transmits and receives; rather than just transmits (via tx_lock)!?
>
>  It would deadlock if we held the lock over both. Generally we don't have
>  to worry about receives, since they're always done in the nbd-client
>  process, so we have control over when and how it exits and cleans up.
>  The odd case, as you've discovered, is when another process (nbd-client
>  -d) comes along and starts mucking with the queue and socket. Would
>  "kill -9 <nbd-client-pid>" work for you instead? That is what I use to
>  break the connection, and it's safe, as it tells the original nbd-client
>  to exit (which it does cleanly and safely).

I'm aware tx_lock can't be held over both; I was suggesting maybe
another lock but that feels like overkill.

I use 'nbd-client -d' and then resort to 'kill -9' IFF 'nbd-client -d'
returned non-zero.
But it sounds like simply using 'kill -9' could be a near-term
workaround, I'll try this as well and will report back.

thanks,
Mike

View attachment "nbd_sock_xmit_oops.patch" of type "text/x-patch" (611 bytes)