lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 30 Jan 2012 18:23:06 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Ian Campbell <Ian.Campbell@...rix.com>
Cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Neil Brown <neilb@...e.de>,
	"J. Bruce Fields" <bfields@...ldses.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>
Subject: Re: [PATCH v3 6/6] sunrpc: use SKB fragment destructors to delay
 completion until page is released by network stack.

On Mon, Jan 30, 2012 at 03:51:53PM +0000, Ian Campbell wrote:
> On Thu, 2012-01-26 at 13:11 +0000, Michael S. Tsirkin wrote:
> > On Wed, Jan 25, 2012 at 12:27:14PM +0000, Ian Campbell wrote:
> > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > completes and the userspace process can continue, potentially modifying data
> > > referenced by the retransmission before the retransmission occurs.
> > > 
> > > Signed-off-by: Ian Campbell <ian.campbell@...rix.com>
> > > Acked-by: Trond Myklebust <Trond.Myklebust@...app.com>
> > > Cc: "David S. Miller" <davem@...emloft.net>
> > > Cc: Neil Brown <neilb@...e.de>
> > > Cc: "J. Bruce Fields" <bfields@...ldses.org>
> > > Cc: linux-nfs@...r.kernel.org
> > > Cc: netdev@...r.kernel.org
> > 
> > This doesn't include either of the two options you proposed to address
> > the sender blocked forever by receiver issue with bridged septups and
> > endpoints such a tap device or a socket on the same box,
> > does it?
> 
> There was never any response to Bruce's question:
> http://thread.gmane.org/gmane.linux.network/210873/focus=44849
> 
>         Stupid question: Is it a requirement that you be safe against DOS by a
>         rogue process with a tap device?  (And if so, does current code satisfy
>         that requirement?)
> 
> IMHO the answer to both questions is no, there are plenty of ways a
> rogue process with a tap device can wreak havoc.

I thought the answer is an obviious yes :(
What are these ways tap can wreak havoc?

> > How about patching __skb_queue_tail to do a deep copy?
> > That would seem to handle both tap and socket cases -
> > any other ones left?
> 
> Wouldn't that mean we were frequently (almost always) copying for lots
> of other cases too? That would rather defeat the purpose of being able
> to hand pages off to the network stack in a zero copy fashion.

Yes but the case of an rpc connection to the same box
is very rare I think, not worth optimizing for.

> > If we do this, I think it would be beneficial to pass a flag
> > to the destructor indicating that a deep copy was done:
> > this would allow senders to detect that and adapt.
> 
> If you were doing a deep copy anyway you might as well create a
> completely new skb and release the old one, thereby causing the
> destructors to fire in the normal way for it SKB. The copy wouldn't have
> destructors because the pages would no longer be owned by the sender.
> 
> Ian.

What I mean is that page pin + deep copy might be more expensive
than directly copying. So the owner of the original skb
cares whether we did a deep copy or zero copy transmit worked.

> > 
> > > ---
> > >  include/linux/sunrpc/xdr.h  |    2 ++
> > >  include/linux/sunrpc/xprt.h |    5 ++++-
> > >  net/sunrpc/clnt.c           |   27 ++++++++++++++++++++++-----
> > >  net/sunrpc/svcsock.c        |    3 ++-
> > >  net/sunrpc/xprt.c           |   12 ++++++++++++
> > >  net/sunrpc/xprtsock.c       |    3 ++-
> > >  6 files changed, 44 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> > > index a20970e..172f81e 100644
> > > --- a/include/linux/sunrpc/xdr.h
> > > +++ b/include/linux/sunrpc/xdr.h
> > > @@ -16,6 +16,7 @@
> > >  #include <asm/byteorder.h>
> > >  #include <asm/unaligned.h>
> > >  #include <linux/scatterlist.h>
> > > +#include <linux/skbuff.h>
> > >  
> > >  /*
> > >   * Buffer adjustment
> > > @@ -57,6 +58,7 @@ struct xdr_buf {
> > >  			tail[1];	/* Appended after page data */
> > >  
> > >  	struct page **	pages;		/* Array of contiguous pages */
> > > +	struct skb_frag_destructor *destructor;
> > >  	unsigned int	page_base,	/* Start of page data */
> > >  			page_len,	/* Length of page data */
> > >  			flags;		/* Flags for data disposition */
> > > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> > > index 15518a1..75131eb 100644
> > > --- a/include/linux/sunrpc/xprt.h
> > > +++ b/include/linux/sunrpc/xprt.h
> > > @@ -92,7 +92,10 @@ struct rpc_rqst {
> > >  						/* A cookie used to track the
> > >  						   state of the transport
> > >  						   connection */
> > > -	
> > > +	struct skb_frag_destructor destructor;	/* SKB paged fragment
> > > +						 * destructor for
> > > +						 * transmitted pages*/
> > > +
> > >  	/*
> > >  	 * Partial send handling
> > >  	 */
> > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > > index f0268ea..06f363f 100644
> > > --- a/net/sunrpc/clnt.c
> > > +++ b/net/sunrpc/clnt.c
> > > @@ -61,6 +61,7 @@ static void	call_reserve(struct rpc_task *task);
> > >  static void	call_reserveresult(struct rpc_task *task);
> > >  static void	call_allocate(struct rpc_task *task);
> > >  static void	call_decode(struct rpc_task *task);
> > > +static void	call_complete(struct rpc_task *task);
> > >  static void	call_bind(struct rpc_task *task);
> > >  static void	call_bind_status(struct rpc_task *task);
> > >  static void	call_transmit(struct rpc_task *task);
> > > @@ -1115,6 +1116,8 @@ rpc_xdr_encode(struct rpc_task *task)
> > >  			 (char *)req->rq_buffer + req->rq_callsize,
> > >  			 req->rq_rcvsize);
> > >  
> > > +	req->rq_snd_buf.destructor = &req->destructor;
> > > +
> > >  	p = rpc_encode_header(task);
> > >  	if (p == NULL) {
> > >  		printk(KERN_INFO "RPC: couldn't encode RPC header, exit EIO\n");
> > > @@ -1278,6 +1281,7 @@ call_connect_status(struct rpc_task *task)
> > >  static void
> > >  call_transmit(struct rpc_task *task)
> > >  {
> > > +	struct rpc_rqst *req = task->tk_rqstp;
> > >  	dprint_status(task);
> > >  
> > >  	task->tk_action = call_status;
> > > @@ -1311,8 +1315,8 @@ call_transmit(struct rpc_task *task)
> > >  	call_transmit_status(task);
> > >  	if (rpc_reply_expected(task))
> > >  		return;
> > > -	task->tk_action = rpc_exit_task;
> > > -	rpc_wake_up_queued_task(&task->tk_xprt->pending, task);
> > > +	task->tk_action = call_complete;
> > > +	skb_frag_destructor_unref(&req->destructor);
> > >  }
> > >  
> > >  /*
> > > @@ -1385,7 +1389,8 @@ call_bc_transmit(struct rpc_task *task)
> > >  		return;
> > >  	}
> > >  
> > > -	task->tk_action = rpc_exit_task;
> > > +	task->tk_action = call_complete;
> > > +	skb_frag_destructor_unref(&req->destructor);
> > >  	if (task->tk_status < 0) {
> > >  		printk(KERN_NOTICE "RPC: Could not send backchannel reply "
> > >  			"error: %d\n", task->tk_status);
> > > @@ -1425,7 +1430,6 @@ call_bc_transmit(struct rpc_task *task)
> > >  			"error: %d\n", task->tk_status);
> > >  		break;
> > >  	}
> > > -	rpc_wake_up_queued_task(&req->rq_xprt->pending, task);
> > >  }
> > >  #endif /* CONFIG_SUNRPC_BACKCHANNEL */
> > >  
> > > @@ -1591,12 +1595,14 @@ call_decode(struct rpc_task *task)
> > >  		return;
> > >  	}
> > >  
> > > -	task->tk_action = rpc_exit_task;
> > > +	task->tk_action = call_complete;
> > >  
> > >  	if (decode) {
> > >  		task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
> > >  						      task->tk_msg.rpc_resp);
> > >  	}
> > > +	rpc_sleep_on(&req->rq_xprt->pending, task, NULL);
> > > +	skb_frag_destructor_unref(&req->destructor);
> > >  	dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
> > >  			task->tk_status);
> > >  	return;
> > > @@ -1611,6 +1617,17 @@ out_retry:
> > >  	}
> > >  }
> > >  
> > > +/*
> > > + * 8.	Wait for pages to be released by the network stack.
> > > + */
> > > +static void
> > > +call_complete(struct rpc_task *task)
> > > +{
> > > +	dprintk("RPC: %5u call_complete result %d\n",
> > > +		task->tk_pid, task->tk_status);
> > > +	task->tk_action = rpc_exit_task;
> > > +}
> > > +
> > >  static __be32 *
> > >  rpc_encode_header(struct rpc_task *task)
> > >  {
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 0e6b919..aa7f108 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -198,7 +198,8 @@ int svc_send_common(struct socket *sock, struct xdr_buf *xdr,
> > >  	while (pglen > 0) {
> > >  		if (slen == size)
> > >  			flags = 0;
> > > -		result = kernel_sendpage(sock, *ppage, NULL, base, size, flags);
> > > +		result = kernel_sendpage(sock, *ppage, xdr->destructor,
> > > +					 base, size, flags);
> > >  		if (result > 0)
> > >  			len += result;
> > >  		if (result != size)
> > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > index c64c0ef..2137eb6 100644
> > > --- a/net/sunrpc/xprt.c
> > > +++ b/net/sunrpc/xprt.c
> > > @@ -1101,6 +1101,16 @@ static inline void xprt_init_xid(struct rpc_xprt *xprt)
> > >  	xprt->xid = net_random();
> > >  }
> > >  
> > > +static int xprt_complete_skb_pages(struct skb_frag_destructor *destroy)
> > > +{
> > > +	struct rpc_rqst	*req =
> > > +		container_of(destroy, struct rpc_rqst, destructor);
> > > +
> > > +	dprintk("RPC: %5u completing skb pages\n", req->rq_task->tk_pid);
> > > +	rpc_wake_up_queued_task(&req->rq_xprt->pending, req->rq_task);
> > 
> > It seems that at the sunrpc module can get unloaded
> > at this point. The code for 'return 0' can then get
> > overwritten. Right?
> > 
> > Not sure how important making module unloading robust
> > is - if we wanted to fix this we'd need to move
> > the callback code into core (hopefully generalizing
> > it somewhat).
> > 
> > > +	return 0;
> > > +}
> > > +
> > >  static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> > >  {
> > >  	struct rpc_rqst	*req = task->tk_rqstp;
> > > @@ -1113,6 +1123,8 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> > >  	req->rq_xid     = xprt_alloc_xid(xprt);
> > >  	req->rq_release_snd_buf = NULL;
> > >  	xprt_reset_majortimeo(req);
> > > +	atomic_set(&req->destructor.ref, 1);
> > > +	req->destructor.destroy = &xprt_complete_skb_pages;
> > >  	dprintk("RPC: %5u reserved req %p xid %08x\n", task->tk_pid,
> > >  			req, ntohl(req->rq_xid));
> > >  }
> > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > index 38b2fec..5406977 100644
> > > --- a/net/sunrpc/xprtsock.c
> > > +++ b/net/sunrpc/xprtsock.c
> > > @@ -408,7 +408,8 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
> > >  		remainder -= len;
> > >  		if (remainder != 0 || more)
> > >  			flags |= MSG_MORE;
> > > -		err = sock->ops->sendpage(sock, *ppage, NULL, base, len, flags);
> > > +		err = sock->ops->sendpage(sock, *ppage, xdr->destructor,
> > > +					  base, len, flags);
> > >  		if (remainder == 0 || err != len)
> > >  			break;
> > >  		sent += err;
> > > -- 
> > > 1.7.2.5
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@...r.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists