linux-kernel - RE: [PATCH] hv_sock: Fix data loss upon socket close

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BN6PR21MB0465E3E8C3C386D21647A237C0080@BN6PR21MB0465.namprd21.prod.outlook.com>
Date:   Tue, 14 May 2019 20:40:56 +0000
From:   Sunil Muthuswamy <sunilmut@...rosoft.com>
To:     Dexuan Cui <decui@...rosoft.com>,
        KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        Sasha Levin <sashal@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Michael Kelley <mikelley@...rosoft.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close



> -----Original Message-----
> From: Dexuan Cui <decui@...rosoft.com>
> Sent: Friday, May 10, 2019 8:57 PM
> To: Sunil Muthuswamy <sunilmut@...rosoft.com>; KY Srinivasan <kys@...rosoft.com>; Haiyang Zhang <haiyangz@...rosoft.com>;
> Stephen Hemminger <sthemmin@...rosoft.com>; Sasha Levin <sashal@...nel.org>; David S. Miller <davem@...emloft.net>;
> Michael Kelley <mikelley@...rosoft.com>
> Cc: netdev@...r.kernel.org; linux-hyperv@...r.kernel.org; linux-kernel@...r.kernel.org
> Subject: RE: [PATCH] hv_sock: Fix data loss upon socket close
> 
> > From: Sunil Muthuswamy <sunilmut@...rosoft.com>
> > Sent: Wednesday, May 8, 2019 4:11 PM
> >
> > Currently, when a hvsock socket is closed, the socket is shutdown and
> > immediately a RST is sent. There is no wait for the FIN packet to arrive
> > from the other end. This can lead to data loss since the connection is
> > terminated abruptly. This can manifest easily in cases of a fast guest
> > hvsock writer and a much slower host hvsock reader. Essentially hvsock is
> > not following the proper STREAM(TCP) closing handshake mechanism.
> 
> Hi Sunil,
> It looks to me the above description is inaccurate.
> 
> In the upstream Linux kernel, closing a hv_sock file descriptor may hang
> in vmbus_hvsock_device_unregister() -> msleep(), until the host side of
> the connection is closed. This is bad and should be fixed, but I don't think
> the current code can cause data loss: when Linux calls hvs_destruct() ->
> vmbus_hvsock_device_unregister() -> vmbus_device_unregister() -> ...
> -> vmbus_close() to close the channel, Linux knows the host app has
> already called close(), and normally that means the host app has
> received all the data from the connection.
> 
> BTW, technically speaking, in hv_sock there is no RST packet, while there
> is indeed a payload_len==0 packet, which is similar to TCP FIN.
> 
> I think by saying "a RST is sent" you mean Linux VM closes the channel.
> 
> > The fix involves adding support for the delayed close of hvsock, which is
> > in-line with other socket providers such as virtio.
> 
> With this "delayed close" patch, Linux's close() won't hang until the host
> also closes the connection. This is good!
> 
The next version of the patch will only focus on implementing the delayed
(or background) close logic. I will update the title and the description of the
next version patch to more accurately reflect the change. 
> > While closing, the
> > socket waits for a constant timeout, for the FIN packet to arrive from the
> > other end. On timeout, it will terminate the connection (i.e a RST).
> 
> As I mentioned above, I suppose the "RST" means Linux closes the channel.
> 
> When Linux closes a connection, the FIN packet is written into the shared
> guest-to-host channel ringbuffer immediately, so the host is able to see it
> immediately, but the real question is: what if the host kernel and/or host app
> can not (timely) receive the data from the ringbuffer, inclding the FIN?
> 
> Does the host kernel guarantee it *always* timely fetches/caches all the
> data from a connection, even if the host app has not accept()'d the
> conection, or the host app is reading from the connection too slowly?
> 
> If the host doesn't guarantee that, then even with this patch there is still
> a chance Linux can time out, and close the channel before the host
> finishes receiving all the data.
> 
TCP protocol does not guarantee that all the data gets delivered, especially
during close. The applications are required to mitigate this by using a
combination of shutdown() and subsequent read() on both client and server
side.
> I'm curious how Windows guest implements the "async close"?
> Does Windows guest also use the same timeout strategy here? If yes,
> what's the timeout value used?
> 
Windows also implements the delayed close logic in a similar fashion. You can
lookup the MSDN article on 'graceful shutdown'.
> > diff --git a/net/vmw_vsock/hyperv_transport.c
> > b/net/vmw_vsock/hyperv_transport.c
> > index a827547..62b986d 100644
> 
> Sorry, I need more time to review the rest of patch. Will try to reply ASAP.
> 
> > -static int hvs_update_recv_data(struct hvsock *hvs)
> > +static int hvs_update_recv_data(struct vsock_sock *vsk)
> >  {
> >  	struct hvs_recv_buf *recv_buf;
> >  	u32 payload_len;
> > +	struct hvsock *hvs = vsk->trans;
> >
> >  	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
> >  	payload_len = recv_buf->hdr.data_size;
> > @@ -543,8 +591,12 @@ static int hvs_update_recv_data(struct hvsock *hvs)
> >  	if (payload_len > HVS_MTU_SIZE)
> >  		return -EIO;
> >
> > -	if (payload_len == 0)
> > +	/* Peer shutdown */
> > +	if (payload_len == 0) {
> > +		struct sock *sk = sk_vsock(vsk);
> >  		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
> > +		sk->sk_state_change(sk);
> > +	}
> 
> Can you please explain why we need to call this sk->sk_state_change()?
> 
> When we call hvs_update_recv_data(), we hold the lock_sock(sk) lock, and we
> know there is at least one byte to read. Since we hold the lock, the other
> code paths, which normally are also requried to acquire the lock before
> checking vsk->peer_shutdown, can not race with us.
> 
This was to wakeup any waiters on socket state change. Since the updated
patch now only focuses on adding the delayed close logic, I will remove this in
the next version.
Thanks for the review so far.
> Thanks,
> -- Dexuan