linux-kernel - Re: [PATCH] Persistent grant maps for xen blk drivers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120919141138.GC12367@phenom.dumpdata.com>
Date:	Wed, 19 Sep 2012 10:11:38 -0400
From:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
To:	Oliver Chick <oliver.chick@...rix.com>
Cc:	xen-devel@...ts.xen.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Persistent grant maps for xen blk drivers

On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick wrote:
> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
> 
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
> 
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
> 
> On startup, blk{front,back} use xenbus to communicate their ability to
> perform 'feature-persistent'. Iff both ends have this ability,
> persistent mode is used.
> 
> To perform a read, in persistent mode, blkfront uses a separate pool
> of pages that it maps to dom0. When a request comes in, blkfront
> transmutes the request so that blkback will write into one of these
> free pages. Blkback keeps note of which grefs it has already
> mapped. When a new ring request comes to blkback, it looks to see if
> it has already mapped that page. If so, it will not map it again. If
> the page hasn't been previously mapped, it is mapped now, and a record
> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> notified that blkback has completed a request, it memcpy's from the
> shared memory, into the bvec supplied. A record that the {gref, page}
> tuple is mapped, and not inflight is kept.
> 
> Writes are similar, except that the memcpy is peformed from the
> supplied bvecs, into the shared pages, before the request is put onto
> the ring.
> 
> Blkback has to store a mapping of grefs=>{page mapped to by gref} in
> an array. As the grefs are not known apriori, and provide no
> guarantees on their ordering, we have to perform a linear search
> through this array to find the page, for every gref we receive. The
> overhead of this is low, however future work might want to use a more
> efficient data structure to reduce this O(n) operation.
> 
> We (ijc, and myself) have introduced a new constant,
> BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> from attempting a DoS, by supplying fresh grefs, causing the Dom0
> kernel from to map excessively. This is currently set to 256---the
> maximum number of entires in the ring. As the number of inflight
> requests <= size of ring, 256 is also the maximum sensible size. This
> introduces a maximum overhead of 11MB of mapped memory, per block
> device. In practice, we don't typically use about 60 of these. If the
> guest exceeds the 256 limit, it is either buggy or malicious. We treat
> this in one of two ways:
> 1) If we have mapped <
> BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
> pages, we will persistently map the grefs. This can occur is previous
> requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
> 2) Otherwise, we revert to non-persistent grants for all future grefs.
> 
> In writing this patch, the question arrises as to if the additional
> cost of performing memcpys in the guest (to/from the pool of granted
> pages) outweigh the gains of not performing TLB shootdowns. The answer
> to that question is `no'. There appears to be very little, if any
> additional cost to the guest of using persistent grants. There is
> perhaps a small saving, from the reduced number of hypercalls
> performed in granting, and ending foreign access.
> 
> 
> Signed-off-by: Oliver Chick <oliver.chick@...rix.com>
> ---
>  drivers/block/xen-blkback/blkback.c |  149 +++++++++++++++++++++++++---
>  drivers/block/xen-blkback/common.h  |   16 +++
>  drivers/block/xen-blkback/xenbus.c  |   21 +++-
>  drivers/block/xen-blkfront.c        |  186 ++++++++++++++++++++++++++++++-----
>  include/xen/interface/io/blkif.h    |    9 ++
>  5 files changed, 338 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 73f196c..f95dee9 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -78,6 +78,7 @@ struct pending_req {
>  	unsigned short		operation;
>  	int			status;
>  	struct list_head	free_list;
> +	u8			is_pers;

Just do what you did in the blkfront.c:

	unsigned int feature_persistent:1

or:
	unsigned int flag_persistent:1

or:
	unsigned int flag_pers_gnt:1

>  };
>  
>  #define BLKBACK_INVALID_HANDLE (~0)
> @@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  static void make_response(struct xen_blkif *blkif, u64 id,
>  			  unsigned short op, int st);
>  
> +static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif *blkif)
> +{
> +	BUG_ON(blkif->pers_gnt_c >= BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +	       BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +	blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
> +}
> +
> +static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
> +				     grant_ref_t gref)
> +{
> +	int i;
> +
> +	for (i = 0; i < blkif->pers_gnt_c; i++)
> +		if (gref == blkif->pers_gnts[i]->gnt)
> +			return blkif->pers_gnts[i];
> +	return NULL;
> +}
> +
>  /*
>   * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>   */
> @@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
>  {
>  	struct xen_blkif *blkif = arg;
>  	struct xen_vbd *vbd = &blkif->vbd;
> +	struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	int i;
> +	int ret = 0;
> +	int segs_to_unmap;
>  
>  	xen_blkif_get(blkif);
>  
> @@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
>  			print_stats(blkif);
>  	}
>  
> +	/* Free all persistent grant pages */
> +
> +	while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
> +				 blkif->pers_gnt_c)) {
> +
> +		for (i = 0; i < segs_to_unmap; i++) {
> +			pers_gnt = blkif->pers_gnts[blkif->pers_gnt_c - i - 1];
> +
> +			gnttab_set_unmap_op(&unmap[i],
> +					    pfn_to_kaddr(page_to_pfn(
> +							     pers_gnt->page)),
> +					    GNTMAP_host_map,
> +					    pers_gnt->handle);
> +
> +			pages[i] = pers_gnt->page;
> +		}
> +
> +		ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap, false);
> +		BUG_ON(ret);
> +
> +		blkif->pers_gnt_c -= segs_to_unmap;
> +
> +	}
> +
>  	if (log_stats)
>  		print_stats(blkif);
>  
> @@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req *req)
>  
>  static int xen_blkbk_map(struct blkif_request *req,
>  			 struct pending_req *pending_req,
> -			 struct seg_buf seg[])
> +			 struct seg_buf seg[],
> +			 struct page *pages[])
>  {
>  	struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct pers_gnt *pers_gnt;
> +	phys_addr_t addr;
>  	int i;
> +	int new_map;
>  	int nseg = req->u.rw.nr_segments;
> +	int segs_to_init = 0;

segs_to_map is probably a better name.

>  	int ret = 0;
> +	int use_pers_gnts;
>  
> +	use_pers_gnts = (pending_req->blkif->can_grant_persist &&
> +			 pending_req->blkif->pers_gnt_c <
> +			 BLKIF_MAX_SEGMENTS_PER_REQUEST *
> +			 BLKIF_MAX_PERS_REQUESTS_PER_DEV);
> +
> +	pending_req->is_pers = use_pers_gnts;
>  	/*
>  	 * Fill out preq.nr_sects with proper amount of sectors, and setup
>  	 * assign map[..] with the PFN of the page in our domain with the
> @@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request *req,
>  	for (i = 0; i < nseg; i++) {
>  		uint32_t flags;
>  
> +		if (use_pers_gnts) {
> +			pers_gnt = get_pers_gnt(pending_req->blkif,
> +						req->u.rw.seg[i].gref);
> +			if (!pers_gnt) {
> +				new_map = 1;
> +				pers_gnt = kmalloc(sizeof(struct pers_gnt),
> +						   GFP_KERNEL);
> +				if (!pers_gnt)
> +					return -ENOMEM;
> +				pers_gnt->page = alloc_page(GFP_KERNEL);
> +				if (!pers_gnt->page)
> +					return -ENOMEM;

You want to kfree pers_gnt here.

> +				pers_gnt->gnt = req->u.rw.seg[i].gref;
> +
> +				pages_to_gnt[segs_to_init] = pers_gnt->page;
> +				new_pers_gnts[segs_to_init] = pers_gnt;
> +
> +				add_pers_gnt(pers_gnt, pending_req->blkif);
> +
> +			} else {
> +				new_map = 0;
> +			}
> +			pages[i] = pers_gnt->page;
> +			addr = (unsigned long) pfn_to_kaddr(
> +				page_to_pfn(pers_gnt->page));

Would it make sense to cache this in the 'pers_gnt' structure?

> +			pers_gnts[i] = pers_gnt;
> +		} else {
> +			new_map = 1;
> +			pages[i] = blkbk->pending_page(pending_req, i);
> +			addr = vaddr(pending_req, i);
> +			pages_to_gnt[i] = blkbk->pending_page(pending_req, i);
> +		}
> +
>  		flags = GNTMAP_host_map;
> -		if (pending_req->operation != BLKIF_OP_READ)
> +		if (!use_pers_gnts && (pending_req->operation != BLKIF_OP_READ))
>  			flags |= GNTMAP_readonly;
> -		gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> -				  req->u.rw.seg[i].gref,
> -				  pending_req->blkif->domid);
> +		if (new_map) {
> +			gnttab_set_map_op(&map[segs_to_init++], addr,
> +					  flags, req->u.rw.seg[i].gref,
> +					  pending_req->blkif->domid);
> +		}
>  	}
>  
> -	ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
> -	BUG_ON(ret);
> +	if (segs_to_init) {
> +		ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_init);
> +		BUG_ON(ret);
> +	}
>  
>  	/*
>  	 * Now swizzle the MFN in our domain with the MFN from the other domain
>  	 * so that when we access vaddr(pending_req,i) it has the contents of
>  	 * the page from the other domain.
>  	 */
> -	for (i = 0; i < nseg; i++) {
> +	for (i = 0; i < segs_to_init; i++) {
>  		if (unlikely(map[i].status != 0)) {
>  			pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>  			map[i].handle = BLKBACK_INVALID_HANDLE;
>  			ret |= 1;
>  		}
>  
> -		pending_handle(pending_req, i) = map[i].handle;
> +		if (use_pers_gnts) {
> +			/* store the `out' values from map */
> +		    pending_req->blkif->pers_gnts
> +			[pending_req->blkif->pers_gnt_c - segs_to_init +
> +			 i]->handle = map[i].handle;
> +			new_pers_gnts[i]->dev_bus_addr = map[i].dev_bus_addr;
> +		}
>  
>  		if (ret)
>  			continue;
> -
> -		seg[i].buf  = map[i].dev_bus_addr |
> -			(req->u.rw.seg[i].first_sect << 9);
> +	}
> +	for (i = 0; i < nseg; i++) {
> +		if (use_pers_gnts) {
> +			pending_handle(pending_req, i) = pers_gnts[i]->handle;
> +			seg[i].buf = pers_gnts[i]->dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		} else {
> +			pending_handle(pending_req, i) = map[i].handle;
> +			seg[i].buf = map[i].dev_bus_addr |
> +				(req->u.rw.seg[i].first_sect << 9);
> +		}
>  	}
>  	return ret;
>  }
> @@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
>  	 * the proper response on the ring.
>  	 */
>  	if (atomic_dec_and_test(&pending_req->pendcnt)) {
> -		xen_blkbk_unmap(pending_req);
> +		if (!pending_req->is_pers)
> +			xen_blkbk_unmap(pending_req);
>  		make_response(pending_req->blkif, pending_req->id,
>  			      pending_req->operation, pending_req->status);
>  		xen_blkif_put(pending_req->blkif);
> @@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	int operation;
>  	struct blk_plug plug;
>  	bool drain = false;
> +	struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>  
>  	switch (req->operation) {
>  	case BLKIF_OP_READ:
> @@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	 * the hypercall to unmap the grants - that is all done in
>  	 * xen_blkbk_unmap.
>  	 */
> -	if (xen_blkbk_map(req, pending_req, seg))
> +	if (xen_blkbk_map(req, pending_req, seg, pages))
>  		goto fail_flush;
>  
>  	/*
> @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	for (i = 0; i < nseg; i++) {
>  		while ((bio == NULL) ||
>  		       (bio_add_page(bio,
> -				     blkbk->pending_page(pending_req, i),

Can we get rid of pending_page macro?

> +				     pages[i],
>  				     seg[i].nsec << 9,
>  				     seg[i].buf & ~PAGE_MASK) == 0)) {
>  
> @@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  	return 0;
>  
>   fail_flush:
> -	xen_blkbk_unmap(pending_req);
> +	if (!blkif->can_grant_persist)
> +		xen_blkbk_unmap(pending_req);
>   fail_response:
>  	/* Haven't submitted any bio's yet. */
>  	make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 9ad3b5e..1ecb709 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -47,6 +47,8 @@
>  #define DPRINTK(fmt, args...)				\
>  	pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",		\
>  		 __func__, __LINE__, ##args)
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256

You need a comment explaining why this number.

> +
>  
>  
>  /* Not a real protocol.  Used to generate ring structs which contain
> @@ -164,6 +166,14 @@ struct xen_vbd {
>  
>  struct backend_info;
>  
> +
> +struct pers_gnt {
> +	struct page *page;
> +	grant_ref_t gnt;
> +	uint32_t handle;

Not grant_handle_t ?

> +	uint64_t dev_bus_addr;
> +};
> +
>  struct xen_blkif {
>  	/* Unique identifier for this interface. */
>  	domid_t			domid;
> @@ -190,6 +200,12 @@ struct xen_blkif {
>  	struct task_struct	*xenblkd;
>  	unsigned int		waiting_reqs;
>  
> +	/* frontend feature information */
> +	u8			can_grant_persist:1;

Pls move that to the vbd structure, and if you feel
especially adventourous, you could modify the

bool flush_support
bool discard_secure

to be 'unsigned int feature_flush:1' ,etc.. as its own
seperate patch and then introduce the

unsigned int feature_grnt_pers:1

flag.
> +	struct pers_gnt		*pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> +					  BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	unsigned int		pers_gnt_c;
> +
>  	/* statistics */
>  	unsigned long		st_print;
>  	int			st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 4f66171..dc58cc4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -681,6 +681,13 @@ again:
>  		goto abort;
>  	}
>  
> +	err = xenbus_printf(xbt, dev->nodename, "feature-persistent-grants",
> +			    "%d", 1);
> +	if (err) {
> +		xenbus_dev_fatal(dev, err, "writing persistent capability");

It is not fatal. Just do dev_warn, like the xen_blkbk_discard does.

> +		goto abort;
> +	}
> +
>  	/* FIXME: use a typename instead */
>  	err = xenbus_printf(xbt, dev->nodename, "info", "%u",
>  			    be->blkif->vbd.type |
> @@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
>  	struct xenbus_device *dev = be->dev;
>  	unsigned long ring_ref;
>  	unsigned int evtchn;
> +	u8 pers_grants;
>  	char protocol[64] = "";
>  	int err;
>  
> @@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
>  		xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>  		return -1;
>  	}
> -	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> -		ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> +	err = xenbus_gather(XBT_NIL, dev->otherend,
> +			    "feature-persistent-grants", "%d",
> +			    &pers_grants, NULL);
> +	if (err)
> +		pers_grants = 0;
> +
> +	be->blkif->can_grant_persist = pers_grants;

We should also have a patch for the Xen blkif.h to document this feature.. but that
can be done later.

> +
> +	pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
> +		ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> +		pers_grants);
>  
>  	/* Map the shared frame, irq etc. */
>  	err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index e4fb337..c1cc5fe 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -68,6 +68,13 @@ struct blk_shadow {
>  	struct blkif_request req;
>  	struct request *request;
>  	unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +};
> +
> +struct gnt_list {
> +	grant_ref_t gref;
> +	unsigned long frame;

Pls mention what 'frame' is and what the other ones do.

> +	struct gnt_list *tail;
>  };

How did the compiler actually compile this? You are using it in 'struct blk_shadow'
but it is declared later?
>  
>  static DEFINE_MUTEX(blkfront_mutex);
> @@ -97,11 +104,14 @@ struct blkfront_info
>  	struct work_struct work;
>  	struct gnttab_free_callback callback;
>  	struct blk_shadow shadow[BLK_RING_SIZE];
> +	struct gnt_list *persistent_grants_front;

I don't think you need the 'front' in there. Its obvious you are
in the frontend code.

> +	unsigned int persistent_grants_c;
>  	unsigned long shadow_free;
>  	unsigned int feature_flush;
>  	unsigned int flush_op;
>  	unsigned int feature_discard:1;
>  	unsigned int feature_secdiscard:1;
> +	unsigned int feature_persistent:1;
>  	unsigned int discard_granularity;
>  	unsigned int discard_alignment;
>  	int is_ready;
> @@ -286,22 +296,37 @@ static int blkif_queue_request(struct request *req)
>  	struct blkif_request *ring_req;
>  	unsigned long id;
>  	unsigned int fsect, lsect;
> -	int i, ref;
> +	int i, ref, use_pers_gnts, new_pers_gnts;

use_pers_gnts and new_pers_gnt? Can you document what the difference
is pls?

Perhaps 'new_pers_gnts' should be called:
'got_grant'? or 'using_grant' ?

>  	grant_ref_t gref_head;
> +	struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	char *bvec_data;
> +	void *shared_data;
> +	unsigned long flags;

What kind of flags?
> +	struct page *shared_page;

It is not really shared_page. It is a temporary page. Perhaps
tmp_page ?

> +	struct gnt_list *gnt_list_entry;
>  	struct scatterlist *sg;
>  
>  	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>  		return 1;
>  
> -	if (gnttab_alloc_grant_references(
> -		BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> -		gnttab_request_free_callback(
> -			&info->callback,
> -			blkif_restart_queue_callback,
> -			info,
> -			BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -		return 1;
> +	use_pers_gnts = info->feature_persistent;
> +
> +	if (info->persistent_grants_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +		new_pers_gnts = 1;
> +		if (gnttab_alloc_grant_references(
> +			    BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> +			gnttab_request_free_callback(
> +				&info->callback,
> +				blkif_restart_queue_callback,
> +				info,
> +				BLKIF_MAX_SEGMENTS_PER_REQUEST);
> +			return 1;
> +		}
>  	} else
> +		new_pers_gnts = 0;
>  
>  	/* Fill out a communications ring structure. */
>  	ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> @@ -341,20 +366,66 @@ static int blkif_queue_request(struct request *req)
>  		       BLKIF_MAX_SEGMENTS_PER_REQUEST);
>  
>  		for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> -			buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>  			fsect = sg->offset >> 9;
>  			lsect = fsect + (sg->length >> 9) - 1;
> -			/* install a grant reference. */
> -			ref = gnttab_claim_grant_reference(&gref_head);
> -			BUG_ON(ref == -ENOSPC);
>  
> -			gnttab_grant_foreign_access_ref(
> +			if (use_pers_gnts && info->persistent_grants_c) {
> +				gnt_list_entry = info->persistent_grants_front;
> +
> +				info->persistent_grants_front =
> +					info->persistent_grants_front->tail;
> +				ref = gnt_list_entry->gref;
> +				buffer_mfn = pfn_to_mfn(gnt_list_entry->frame);

Ah, so frame is a PFN! why don't you just call it that?

> +				info->persistent_grants_c--;
> +			} else {
> +				ref = gnttab_claim_grant_reference(&gref_head);
> +				BUG_ON(ref == -ENOSPC);
> +
> +				if (use_pers_gnts) {
> +					gnt_list_entry =
> +						kmalloc(sizeof(struct gnt_list),
> +								 GFP_ATOMIC);
> +					if (!gnt_list_entry)
> +						return -ENOMEM;
> +
> +					shared_page = alloc_page(GFP_ATOMIC);
> +					if (!shared_page)
> +						return -ENOMEM;

Make sure to kfree gnt_list_entry.

> +
> +					gnt_list_entry->frame =
> +						page_to_pfn(shared_page);
> +					gnt_list_entry->gref = ref;
> +				} else
> +					shared_page = sg_page(sg);
> +
> +				buffer_mfn = pfn_to_mfn(page_to_pfn(
> +								shared_page));
> +				gnttab_grant_foreign_access_ref(
>  					ref,
>  					info->xbdev->otherend_id,
>  					buffer_mfn,
> -					rq_data_dir(req));
> +					!use_pers_gnts && rq_data_dir(req));
> +			}
> +
> +			if (use_pers_gnts)
> +				info->shadow[id].grants_used[i] =
> +					gnt_list_entry;
> +
> +			if (use_pers_gnts && rq_data_dir(req)) {

You can declare the 'shared_data' here:

				void *shared_data;

and also bvec_data.

> +				shared_data = kmap_atomic(
> +					pfn_to_page(gnt_list_entry->frame));
> +				bvec_data = kmap_atomic(sg_page(sg));
> +
> +				memcpy(shared_data + sg->offset,
> +				       bvec_data   + sg->offset,
> +				       sg->length);

Do we need to worry about it spilling over a page? Should we check that
sg>offset + sg->length < PAGE_SIZE ?

Also this would imply that based on the offset (so say it is 3999) that the old data (0->3998)
might be still there - don't know how important that is?
> +
> +				kunmap_atomic(bvec_data);
> +				kunmap_atomic(shared_data);
> +			}
>  
>  			info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> +
>  			ring_req->u.rw.seg[i] =
>  					(struct blkif_request_segment) {
>  						.gref       = ref,
> @@ -368,7 +439,8 @@ static int blkif_queue_request(struct request *req)
>  	/* Keep a private copy so we can reissue requests when recovering. */
>  	info->shadow[id].req = *ring_req;
>  
> -	gnttab_free_grant_references(gref_head);
> +	if (new_pers_gnts)
> +		gnttab_free_grant_references(gref_head);
>  
>  	return 0;
>  }
> @@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>  static void xlvbd_flush(struct blkfront_info *info)
>  {
>  	blk_queue_flush(info->rq, info->feature_flush);
> -	printk(KERN_INFO "blkfront: %s: %s: %s\n",
> +	printk(KERN_INFO "blkfront: %s: %s: %s. Persistent=%d\n",

Just use %s like the rest
>  	       info->gd->disk_name,
>  	       info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>  		"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>  		"flush diskcache" : "barrier or flush"),
> -	       info->feature_flush ? "enabled" : "disabled");
> +	       info->feature_flush ? "enabled" : "disabled",
> +	    info->feature_persistent);

and this can be 'info->feature_persistent ? "persitent" : "".

>  }
>  
>  static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> @@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct *work)
>  
>  static void blkif_free(struct blkfront_info *info, int suspend)
>  {
> +	struct gnt_list *pers_gnt;
> +
>  	/* Prevent new requests being issued until we fix things up. */
>  	spin_lock_irq(&info->io_lock);
>  	info->connected = suspend ?
> @@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	/* No more blkif_request(). */
>  	if (info->rq)
>  		blk_stop_queue(info->rq);
> +
> +	/* Remove all persistent grants */
> +	while (info->persistent_grants_front) {
> +		pers_gnt = info->persistent_grants_front;
> +		info->persistent_grants_front = pers_gnt->tail;
> +		gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
> +		kfree(pers_gnt);
> +	}
> +
>  	/* No more gnttab callback work. */
>  	gnttab_cancel_free_callback(&info->callback);
>  	spin_unlock_irq(&info->io_lock);
> @@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  
>  }
>  
> -static void blkif_completion(struct blk_shadow *s)
> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> +			     struct blkif_response *bret)
>  {
>  	int i;
> -	/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> -	 * flag. */
> -	for (i = 0; i < s->req.u.rw.nr_segments; i++)
> -		gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> +	struct gnt_list *new_gnt_list_entry;
> +	struct bio_vec *bvec;
> +	struct req_iterator iter;
> +	unsigned long flags;
> +	char *bvec_data;
> +	void *shared_data;
> +
> +	i = 0;
> +	if (info->feature_persistent == 0) {
> +		/* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
> +		 * place flag. */
> +		for (i = 0; i < s->req.u.rw.nr_segments; i++)
> +			gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
> +						  0, 0UL);
> +		return;
> +	}
> +

Move the 'i = 0' to here.

> +	if (bret->operation == BLKIF_OP_READ)
> +		rq_for_each_segment(bvec, s->request, iter) {
> +			shared_data = kmap_atomic
> +				(pfn_to_page(s->grants_used[i++]->frame));
> +			bvec_data = bvec_kmap_irq(bvec, &flags);
> +			memcpy(bvec_data, shared_data + bvec->bv_offset,
> +			       bvec->bv_len);
> +			bvec_kunmap_irq(bvec_data, &flags);
> +			kunmap_atomic(shared_data);
> +		}
> +	/* Add the persistent grant into the list of free grants */
> +	for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> +		new_gnt_list_entry = s->grants_used[i];
> +		new_gnt_list_entry->tail = info->persistent_grants_front;
> +		info->persistent_grants_front = new_gnt_list_entry;
> +		info->persistent_grants_c++;
> +	}
>  }
>  
>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  		req  = info->shadow[id].request;
>  
>  		if (bret->operation != BLKIF_OP_DISCARD)
> -			blkif_completion(&info->shadow[id]);
> +			blkif_completion(&info->shadow[id], info, bret);
>  
>  		if (add_id_to_freelist(info, id)) {
>  			WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> @@ -943,6 +1058,12 @@ again:
>  		message = "writing protocol";
>  		goto abort_transaction;
>  	}
> +	err = xenbus_printf(xbt, dev->nodename,
> +			    "feature-persistent-grants", "%d", 1);
> +	if (err) {
> +		message = "Writing persistent grants front feature";
> +		goto abort_transaction;

I wouldn't abort. Just continue on using the non-grant feature.

> +	}
>  
>  	err = xenbus_transaction_end(xbt, 0);
>  	if (err) {
> @@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>  	spin_lock_init(&info->io_lock);
>  	info->xbdev = dev;
>  	info->vdevice = vdevice;
> +	info->persistent_grants_c = 0;
>  	info->connected = BLKIF_STATE_DISCONNECTED;
>  	INIT_WORK(&info->work, blkif_restart_queue);
>  
> @@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info *info)
>  					req->u.rw.seg[j].gref,
>  					info->xbdev->otherend_id,
>  					pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> +					!info->feature_persistent &&
>  					rq_data_dir(info->shadow[req->u.rw.id].request));
>  		}
>  		info->shadow[req->u.rw.id].req = *req;
> @@ -1226,7 +1349,7 @@ static void blkfront_connect(struct blkfront_info *info)
>  	unsigned long sector_size;
>  	unsigned int binfo;
>  	int err;
> -	int barrier, flush, discard;
> +	int barrier, flush, discard, persistent;
>  
>  	switch (info->connected) {
>  	case BLKIF_STATE_CONNECTED:
> @@ -1297,6 +1420,19 @@ static void blkfront_connect(struct blkfront_info *info)
>  		info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
>  	}
>  
> +	/*
> +	 * Are we dealing with an old blkback that will unmap
> +	 * all grefs?
> +	 */
> +	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> +			    "feature-persistent-grants", "%d", &persistent,
> +			    NULL);
> +
> +	if (err)
> +		info->feature_persistent = 0;
> +	else
> +		info->feature_persistent = persistent;
> +
>  	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
>  			    "feature-discard", "%d", &discard,
>  			    NULL);
> diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
> index ee338bf..cd267a9b 100644
> --- a/include/xen/interface/io/blkif.h
> +++ b/include/xen/interface/io/blkif.h
> @@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
>   */
>  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
>  
> +/*
> + * Maximum number of persistent grants that can be mapped by Dom0 for each

by blkback
> + * interface. This is set to be the size of the ring, as this is a limit on

Size of the ring? You sure?
> + * the number of requests that can be inflight at any one time. 256 imposes
> + * an overhead of 11MB of mapped kernel space per interface.
> + */
> +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> +
> +
>  struct blkif_request_rw {
>  	uint8_t        nr_segments;  /* number of segments                   */
>  	blkif_vdev_t   handle;       /* only for read/write requests         */
> -- 
> 1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/