linux-kernel - Re: [PATCH RFC] Persistent grant maps for xen blk drivers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121023172008.GB11787@phenom.dumpdata.com>
Date:	Tue, 23 Oct 2012 13:20:08 -0400
From:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
To:	Roger Pau Monné <roger.pau@...rix.com>
Cc:	Konrad Rzeszutek Wilk <konrad@...nel.org>,
	"xen-devel@...ts.xen.org" <xen-devel@...ts.xen.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers

On Tue, Oct 23, 2012 at 06:07:36PM +0200, Roger Pau Monné wrote:
> On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:
> > On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
> >> This patch implements persistent grants for the xen-blk{front,back}
> >> mechanism. The effect of this change is to reduce the number of unmap
> >> operations performed, since they cause a (costly) TLB shootdown. This
> >> allows the I/O performance to scale better when a large number of VMs
> >> are performing I/O.
> >>
> >> Previously, the blkfront driver was supplied a bvec[] from the request
> >> queue. This was granted to dom0; dom0 performed the I/O and wrote
> >> directly into the grant-mapped memory and unmapped it; blkfront then
> >> removed foreign access for that grant. The cost of unmapping scales
> >> badly with the number of CPUs in Dom0. An experiment showed that when
> >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> >> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> >> (at which point 650,000 IOPS are being performed in total). If more
> >> than 5 guests are used, the performance declines. By 10 guests, only
> >> 400,000 IOPS are being performed.
> >>
> >> This patch improves performance by only unmapping when the connection
> >> between blkfront and back is broken.
> >>
> >> On startup blkfront notifies blkback that it is using persistent
> >> grants, and blkback will do the same. If blkback is not capable of
> >> persistent mapping, blkfront will still use the same grants, since it
> >> is compatible with the previous protocol, and simplifies the code
> >> complexity in blkfront.
> >>
> >> To perform a read, in persistent mode, blkfront uses a separate pool
> >> of pages that it maps to dom0. When a request comes in, blkfront
> >> transmutes the request so that blkback will write into one of these
> >> free pages. Blkback keeps note of which grefs it has already
> >> mapped. When a new ring request comes to blkback, it looks to see if
> >> it has already mapped that page. If so, it will not map it again. If
> >> the page hasn't been previously mapped, it is mapped now, and a record
> >> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> >> notified that blkback has completed a request, it memcpy's from the
> >> shared memory, into the bvec supplied. A record that the {gref, page}
> >> tuple is mapped, and not inflight is kept.
> >>
> >> Writes are similar, except that the memcpy is peformed from the
> >> supplied bvecs, into the shared pages, before the request is put onto
> >> the ring.
> >>
> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in
> >> a red-black tree. As the grefs are not known apriori, and provide no
> >> guarantees on their ordering, we have to perform a search
> >> through this tree to find the page, for every gref we receive. This
> >> operation takes O(log n) time in the worst case.
> > 
> > Might want to mention how blkfront stores it as well. It looks
> > to be using 'llist' instead of 'list'? Any particular reason?
> 
> Since we are just pushing and poping grant references, I went for what I
> think is the simplest one, a single linked list (list is a doubly linked
> list). Oliver in the previous version was using something similar, but
> custom made. I think it's best to use the data structures provided by
> the kernel itself.
> 
> > 
> >>
> >> The maximum number of grants that blkback will persistenly map is
> >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
> >> prevent a malicios guest from attempting a DoS, by supplying fresh
> >> grefs, causing the Dom0 kernel to map excessively. If a guest
> >> is using persistent grants and exceeds the maximum number of grants to
> >> map persistenly the newly passed grefs will be mapped and unmaped.
> >> Using this approach, we can have requests that mix persistent and
> >> non-persistent grants, and we need to handle them correctly.
> >> This allows us to set the maximum number of persistent grants to a
> >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
> >> setting it will lead to unpredictable performance.
> >>
> >> In writing this patch, the question arrises as to if the additional
> >> cost of performing memcpys in the guest (to/from the pool of granted
> >> pages) outweigh the gains of not performing TLB shootdowns. The answer
> >> to that question is `no'. There appears to be very little, if any
> >> additional cost to the guest of using persistent grants. There is
> >> perhaps a small saving, from the reduced number of hypercalls
> >> performed in granting, and ending foreign access.
> >>
> >> Signed-off-by: Oliver Chick <oliver.chick@...rix.com>
> >> Signed-off-by: Roger Pau Monne <roger.pau@...rix.com>
> >> Cc: <konrad.wilk@...cle.com>
> >> Cc: <linux-kernel@...r.kernel.org>
> >> ---
> >> Benchmarks showing the impact of this patch in blk performance can be
> >> found at:
> >>
> >> http://xenbits.xensource.com/people/royger/persistent_grants/
> >> ---
> >>  drivers/block/xen-blkback/blkback.c |  279 +++++++++++++++++++++++++++++++---
> >>  drivers/block/xen-blkback/common.h  |   17 ++
> >>  drivers/block/xen-blkback/xenbus.c  |   16 ++-
> >>  drivers/block/xen-blkfront.c        |  183 ++++++++++++++++++++----
> >>  4 files changed, 442 insertions(+), 53 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> >> index c6decb9..2b982b2 100644
> >> --- a/drivers/block/xen-blkback/blkback.c
> >> +++ b/drivers/block/xen-blkback/blkback.c
> >> @@ -78,6 +78,7 @@ struct pending_req {
> >>       unsigned short          operation;
> >>       int                     status;
> >>       struct list_head        free_list;
> >> +     unsigned int            unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>  };
> >>
> >>  #define BLKBACK_INVALID_HANDLE (~0)
> >> @@ -98,6 +99,30 @@ struct xen_blkbk {
> >>  static struct xen_blkbk *blkbk;
> >>
> >>  /*
> >> + * Maximum number of grant pages that can be mapped in blkback.
> >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> >> + * pages that blkback will persistently map.
> >> + */
> >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> >> +{
> >> +     switch (protocol) {
> >> +     case BLKIF_PROTOCOL_NATIVE:
> >> +             return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> >> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> +     case BLKIF_PROTOCOL_X86_32:
> >> +             return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> >> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> +     case BLKIF_PROTOCOL_X86_64:
> >> +             return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> >> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> > 
> > Could you include in the comments what the size (bytes) you expect it to be?
> > If that would require you re-doing some tests - then don't worry - but
> > I figured you have some notes scribbled away that have the exact values
> > down.
> 
> As far as I know and remember (I've checked the ring size in the past),
> all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is
> always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 *
> 48 = 16896 bytes. I will add a comment with this calculation.
> 
> > 
> >> +     default:
> >> +             BUG();
> >> +     }
> >> +     return 0;
> >> +}
> >> +
> >> +
> >> +/*
> >>   * Little helpful macro to figure out the index and virtual address of the
> >>   * pending_pages[..]. For each 'pending_req' we have have up to
> >>   * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >>  static void make_response(struct xen_blkif *blkif, u64 id,
> >>                         unsigned short op, int st);
> >>
> >> +#define foreach_grant(pos, rbtree, node) \
> >> +     for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
> >> +          &(pos)->node != NULL; \
> >> +          (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
> >> +
> >> +
> >> +static void add_persistent_gnt(struct rb_root *root,
> >> +                            struct persistent_gnt *persistent_gnt)
> >> +{
> >> +     struct rb_node **new = &(root->rb_node), *parent = NULL;
> >> +     struct persistent_gnt *this;
> >> +
> >> +     /* Figure out where to put new node */
> >> +     while (*new) {
> >> +             this = container_of(*new, struct persistent_gnt, node);
> >> +
> >> +             parent = *new;
> >> +             if (persistent_gnt->gnt < this->gnt)
> >> +                     new = &((*new)->rb_left);
> >> +             else if (persistent_gnt->gnt > this->gnt)
> >> +                     new = &((*new)->rb_right);
> >> +             else {
> >> +                     pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
> >> +                     BUG();
> >> +             }
> >> +     }
> >> +
> >> +     /* Add new node and rebalance tree. */
> >> +     rb_link_node(&(persistent_gnt->node), parent, new);
> >> +     rb_insert_color(&(persistent_gnt->node), root);
> >> +}
> >> +
> >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
> >> +                                              grant_ref_t gref)
> >> +{
> >> +     struct persistent_gnt *data;
> >> +     struct rb_node *node = root->rb_node;
> >> +
> >> +     while (node) {
> >> +             data = container_of(node, struct persistent_gnt, node);
> >> +
> >> +             if (gref < data->gnt)
> >> +                     node = node->rb_left;
> >> +             else if (gref > data->gnt)
> >> +                     node = node->rb_right;
> >> +             else
> >> +                     return data;
> >> +     }
> >> +     return NULL;
> >> +}
> >> +
> >>  /*
> >>   * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
> >>   */
> >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
> >>  {
> >>       struct xen_blkif *blkif = arg;
> >>       struct xen_vbd *vbd = &blkif->vbd;
> >> +     struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct persistent_gnt *persistent_gnt;
> >> +     int ret = 0;
> >> +     int segs_to_unmap = 0;
> >>
> >>       xen_blkif_get(blkif);
> >>
> >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
> >>                       print_stats(blkif);
> >>       }
> >>
> >> +     /* Free all persistent grant pages */
> >> +     foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
> > 
> > Just for sanity - you did check this with blkfronts that did not have
> > persistent grants enabled, right?
> 
> Yes, it doesn't crash, but looking at foreach_grant it seems like it
> should. I've added a check before trying to iterate over the tree.
> 
> > 
> >> +             BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
> >> +             gnttab_set_unmap_op(&unmap[segs_to_unmap],
> >> +                 (unsigned long) pfn_to_kaddr(page_to_pfn(
> >> +                     persistent_gnt->page)),
> >> +                 GNTMAP_host_map,
> >> +                 persistent_gnt->handle);
> >> +
> >> +             pages[segs_to_unmap] = persistent_gnt->page;
> >> +             rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
> >> +             kfree(persistent_gnt);
> >> +             blkif->persistent_gnt_c--;
> >> +
> >> +             if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
> >> +                     !rb_next(&persistent_gnt->node)) {
> >> +                     ret = gnttab_unmap_refs(unmap, NULL, pages,
> >> +                                             segs_to_unmap);
> >> +                     BUG_ON(ret);
> >> +                     segs_to_unmap = 0;
> >> +             }
> >> +     }
> >> +
> >> +     BUG_ON(blkif->persistent_gnt_c != 0);
> >> +     BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> >> +
> >>       if (log_stats)
> >>               print_stats(blkif);
> >>
> >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
> >>       int ret;
> >>
> >>       for (i = 0; i < req->nr_pages; i++) {
> >> +             if (!req->unmap_seg[i])
> >> +                     continue;
> > 
> > Perhaps there should be a #define for that array..
> 
> Do you mean something like:
> 
> #define unmap(req, i) req->unmap_seg[i]

I was thinking that you just check for req->unamp_seg[i] to
have an non-zero value. But since that array is just used as an check
to see whether the functionality is enabled (or not), you might want
to declerare the right values so:
#define UNMAP_SG_ON 1
#define UNMAP_SG_OFF 0

or so.

> 
> >>               handle = pending_handle(req, i);
> >>               if (handle == BLKBACK_INVALID_HANDLE)
> >>                       continue;
> >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
> >>
> >>  static int xen_blkbk_map(struct blkif_request *req,
> >>                        struct pending_req *pending_req,
> >> -                      struct seg_buf seg[])
> >> +                      struct seg_buf seg[],
> >> +                      struct page *pages[])
> >>  {
> >>       struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> -     int i;
> >> +     struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct persistent_gnt *persistent_gnt = NULL;
> >> +     struct xen_blkif *blkif = pending_req->blkif;
> >> +     phys_addr_t addr = 0;
> >> +     int i, j;
> >> +     int new_map;
> > 
> > Just use a bool for this.
> 
> Done
> 
> >>       int nseg = req->u.rw.nr_segments;
> >> +     int segs_to_map = 0;
> >>       int ret = 0;
> >> +     int use_persistent_gnts;
> >> +
> >> +     use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
> >> +
> >> +     BUG_ON(blkif->persistent_gnt_c >
> >> +                max_mapped_grant_pages(pending_req->blkif->blk_protocol));
> >>
> >>       /*
> >>        * Fill out preq.nr_sects with proper amount of sectors, and setup
> >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
> >>       for (i = 0; i < nseg; i++) {
> >>               uint32_t flags;
> >>
> >> -             flags = GNTMAP_host_map;
> >> -             if (pending_req->operation != BLKIF_OP_READ)
> >> -                     flags |= GNTMAP_readonly;
> >> -             gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> >> -                               req->u.rw.seg[i].gref,
> >> -                               pending_req->blkif->domid);
> >> +             if (use_persistent_gnts)
> >> +                     persistent_gnt = get_persistent_gnt(
> >> +                             &blkif->persistent_gnts,
> >> +                             req->u.rw.seg[i].gref);
> >> +
> >> +             if (persistent_gnt) {
> >> +                     /*
> >> +                      * We are using persistent grants and
> >> +                      * the grant is already mapped
> >> +                      */
> >> +                     new_map = 0;
> >> +             } else if (use_persistent_gnts &&
> >> +                        blkif->persistent_gnt_c <
> >> +                        max_mapped_grant_pages(blkif->blk_protocol)) {
> >> +                     /*
> >> +                      * We are using persistent grants, the grant is
> >> +                      * not mapped but we have room for it
> >> +                      */
> >> +                     new_map = 1;
> >> +                     persistent_gnt = kzalloc(
> >> +                             sizeof(struct persistent_gnt),
> >> +                             GFP_KERNEL);
> >> +                     if (!persistent_gnt)
> >> +                             return -ENOMEM;
> >> +                     persistent_gnt->page = alloc_page(GFP_KERNEL);
> >> +                     if (!persistent_gnt->page) {
> >> +                             kfree(persistent_gnt);
> >> +                             return -ENOMEM;
> >> +                     }
> >> +                     persistent_gnt->gnt = req->u.rw.seg[i].gref;
> >> +
> >> +                     pages_to_gnt[segs_to_map] =
> >> +                             persistent_gnt->page;
> >> +                     addr = (unsigned long) pfn_to_kaddr(
> >> +                             page_to_pfn(persistent_gnt->page));
> >> +
> >> +                     add_persistent_gnt(&blkif->persistent_gnts,
> >> +                             persistent_gnt);
> >> +                     blkif->persistent_gnt_c++;
> >> +                     pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> >> +                              persistent_gnt->gnt, blkif->persistent_gnt_c,
> >> +                              max_mapped_grant_pages(blkif->blk_protocol));
> >> +             } else {
> >> +                     /*
> >> +                      * We are either using persistent grants and
> >> +                      * hit the maximum limit of grants mapped,
> >> +                      * or we are not using persistent grants.
> >> +                      */
> >> +                     if (use_persistent_gnts &&
> >> +                             !blkif->vbd.overflow_max_grants) {
> >> +                             blkif->vbd.overflow_max_grants = 1;
> >> +                             pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> >> +                                      blkif->domid, blkif->vbd.handle);
> >> +                     }
> >> +                     new_map = 1;
> >> +                     pages[i] = blkbk->pending_page(pending_req, i);
> >> +                     addr = vaddr(pending_req, i);
> >> +                     pages_to_gnt[segs_to_map] =
> >> +                             blkbk->pending_page(pending_req, i);
> >> +             }
> >> +
> >> +             if (persistent_gnt) {
> >> +                     pages[i] = persistent_gnt->page;
> >> +                     persistent_gnts[i] = persistent_gnt;
> >> +             } else {
> >> +                     persistent_gnts[i] = NULL;
> >> +             }
> >> +
> >> +             if (new_map) {
> >> +                     flags = GNTMAP_host_map;
> >> +                     if (!persistent_gnt &&
> >> +                         (pending_req->operation != BLKIF_OP_READ))
> >> +                             flags |= GNTMAP_readonly;
> >> +                     gnttab_set_map_op(&map[segs_to_map++], addr,
> >> +                                       flags, req->u.rw.seg[i].gref,
> >> +                                       blkif->domid);
> >> +             }
> >>       }
> >>
> >> -     ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
> >> -     BUG_ON(ret);
> >> +     if (segs_to_map) {
> >> +             ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
> >> +             BUG_ON(ret);
> >> +     }
> >>
> >>       /*
> >>        * Now swizzle the MFN in our domain with the MFN from the other domain
> >>        * so that when we access vaddr(pending_req,i) it has the contents of
> >>        * the page from the other domain.
> >>        */
> >> -     for (i = 0; i < nseg; i++) {
> >> -             if (unlikely(map[i].status != 0)) {
> >> -                     pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> >> -                     map[i].handle = BLKBACK_INVALID_HANDLE;
> >> -                     ret |= 1;
> >> +     for (i = 0, j = 0; i < nseg; i++) {
> >> +             if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
> >> +                     /* This is a newly mapped grant */
> >> +                     BUG_ON(j >= segs_to_map);
> >> +                     if (unlikely(map[j].status != 0)) {
> > 
> > I am not seeing j being incremented anywhere? Should it?
> 
> Yes, it should be incremented, but not here. See the comment below to
> see what I've changed.
> 
> > 
> >> +                             pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> >> +                             map[j].handle = BLKBACK_INVALID_HANDLE;
> >> +                             ret |= 1;
> >> +                             if (persistent_gnts[i]) {
> >> +                                     rb_erase(&persistent_gnts[i]->node,
> >> +                                              &blkif->persistent_gnts);
> >> +                                     blkif->persistent_gnt_c--;
> >> +                                     kfree(persistent_gnts[i]);
> >> +                                     persistent_gnts[i] = NULL;
> >> +                             }
> >> +                     }
> >> +             }
> >> +             if (persistent_gnts[i]) {
> >> +                     if (!persistent_gnts[i]->handle) {
> >> +                             /*
> >> +                              * If this is a new persistent grant
> >> +                              * save the handler
> >> +                              */
> >> +                             persistent_gnts[i]->handle = map[j].handle;
> >> +                             persistent_gnts[i]->dev_bus_addr =
> >> +                                     map[j++].dev_bus_addr;
> >> +                     }
> >> +                     pending_handle(pending_req, i) =
> >> +                             persistent_gnts[i]->handle;
> >> +                     pending_req->unmap_seg[i] = 0;
> > 
> > Could we have a #define for that?
> 
> Sure.
> 
> >> +
> >> +                     if (ret)
> >> +                             continue;
> 
> This should be
> 
> if (ret) {
> 	j++;
> 	continue;
> }

<nods>
> 
> >> +
> >> +                     seg[i].buf = persistent_gnts[i]->dev_bus_addr |
> >> +                             (req->u.rw.seg[i].first_sect << 9);
> >> +             } else {
> >> +                     pending_handle(pending_req, i) = map[j].handle;
> >> +                     pending_req->unmap_seg[i] = 1;
> > 
> > And here as well?
> 
> Done.
> 
> >> +
> >> +                     if (ret)
> >> +                             continue;
> >> +
> >> +                     seg[i].buf = map[j++].dev_bus_addr |
> >> +                             (req->u.rw.seg[i].first_sect << 9);
> >>               }
> >> -
> >> -             pending_handle(pending_req, i) = map[i].handle;
> >> -
> >> -             if (ret)
> >> -                     continue;
> >> -
> >> -             seg[i].buf  = map[i].dev_bus_addr |
> >> -                     (req->u.rw.seg[i].first_sect << 9);
> >>       }
> >>       return ret;
> >>  }
> >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >>       int operation;
> >>       struct blk_plug plug;
> >>       bool drain = false;
> >> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>
> >>       switch (req->operation) {
> >>       case BLKIF_OP_READ:
> >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >>        * the hypercall to unmap the grants - that is all done in
> >>        * xen_blkbk_unmap.
> >>        */
> >> -     if (xen_blkbk_map(req, pending_req, seg))
> >> +     if (xen_blkbk_map(req, pending_req, seg, pages))
> >>               goto fail_flush;
> >>
> >>       /*
> >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >>       for (i = 0; i < nseg; i++) {
> >>               while ((bio == NULL) ||
> >>                      (bio_add_page(bio,
> >> -                                  blkbk->pending_page(pending_req, i),
> >> +                                  pages[i],
> >>                                    seg[i].nsec << 9,
> >>                                    seg[i].buf & ~PAGE_MASK) == 0)) {
> >>
> >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> >> index 9ad3b5e..6c08ee9 100644
> >> --- a/drivers/block/xen-blkback/common.h
> >> +++ b/drivers/block/xen-blkback/common.h
> >> @@ -34,6 +34,7 @@
> >>  #include <linux/vmalloc.h>
> >>  #include <linux/wait.h>
> >>  #include <linux/io.h>
> >> +#include <linux/rbtree.h>
> >>  #include <asm/setup.h>
> >>  #include <asm/pgalloc.h>
> >>  #include <asm/hypervisor.h>
> >> @@ -160,10 +161,22 @@ struct xen_vbd {
> >>       sector_t                size;
> >>       bool                    flush_support;
> >>       bool                    discard_secure;
> >> +
> >> +     unsigned int            feature_gnt_persistent:1;
> >> +     unsigned int            overflow_max_grants:1;
> > 
> > I think the v3.7-rc1 has this structure changed just a tiny bit, so you
> > might want to rebase it on top of that.
> 
> I've done the rebase on top of your linux-next branch, commit
> ad502612c173fff239250c9fe9bdfaaef70b9901.

Thx
> 
> > 
> >>  };
> >>
> >>  struct backend_info;
> >>
> >> +
> >> +struct persistent_gnt {
> >> +     struct page *page;
> >> +     grant_ref_t gnt;
> >> +     grant_handle_t handle;
> >> +     uint64_t dev_bus_addr;
> >> +     struct rb_node node;
> >> +};
> >> +
> >>  struct xen_blkif {
> >>       /* Unique identifier for this interface. */
> >>       domid_t                 domid;
> >> @@ -190,6 +203,10 @@ struct xen_blkif {
> >>       struct task_struct      *xenblkd;
> >>       unsigned int            waiting_reqs;
> >>
> >> +     /* frontend feature information */
> > 
> > Huh?
> 
> Changed it to:
> 
> /* tree to store persistent grants */
> 
> >> +     struct rb_root          persistent_gnts;
> >> +     unsigned int            persistent_gnt_c;
> >> +
> >>       /* statistics */
> >>       unsigned long           st_print;
> >>       int                     st_rd_req;
> >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> >> index 4f66171..9f88b4e 100644
> >> --- a/drivers/block/xen-blkback/xenbus.c
> >> +++ b/drivers/block/xen-blkback/xenbus.c
> >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> >>       atomic_set(&blkif->drain, 0);
> >>       blkif->st_print = jiffies;
> >>       init_waitqueue_head(&blkif->waiting_to_free);
> >> +     blkif->persistent_gnts.rb_node = NULL;
> >>
> >>       return blkif;
> >>  }
> >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
> >>       struct xenbus_device *dev = be->dev;
> >>       unsigned long ring_ref;
> >>       unsigned int evtchn;
> >> +     unsigned int pers_grants;
> >>       char protocol[64] = "";
> >>       int err;
> >>
> >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
> >>               xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> >>               return -1;
> >>       }
> >> -     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> >> -             ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> >> +     err = xenbus_gather(XBT_NIL, dev->otherend,
> >> +                         "feature-persistent-grants", "%u",
> >> +                         &pers_grants, NULL);
> >> +     if (err)
> >> +             pers_grants = 0;
> >> +
> >> +     be->blkif->vbd.feature_gnt_persistent = pers_grants;
> >> +     be->blkif->vbd.overflow_max_grants = 0;
> >> +
> >> +     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
> >> +             ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> >> +             pers_grants);
> > 
> > Can you make that a string? So it is
> >  pers_grants ? "persistent grants" : ""
> > 
> > instead of a zero of one value pls?
> 
> Yes, done.
> 
> >>
> >>       /* Map the shared frame, irq etc. */
> >>       err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >> index 2c2d2e5..206d422 100644
> >> --- a/drivers/block/xen-blkfront.c
> >> +++ b/drivers/block/xen-blkfront.c
> >> @@ -44,6 +44,7 @@
> >>  #include <linux/mutex.h>
> >>  #include <linux/scatterlist.h>
> >>  #include <linux/bitmap.h>
> >> +#include <linux/llist.h>
> >>
> >>  #include <xen/xen.h>
> >>  #include <xen/xenbus.h>
> >> @@ -64,10 +65,17 @@ enum blkif_state {
> >>       BLKIF_STATE_SUSPENDED,
> >>  };
> >>
> >> +struct grant {
> >> +     grant_ref_t gref;
> >> +     unsigned long pfn;
> >> +     struct llist_node node;
> >> +};
> >> +
> >>  struct blk_shadow {
> >>       struct blkif_request req;
> >>       struct request *request;
> >>       unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> +     struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>  };
> >>
> >>  static DEFINE_MUTEX(blkfront_mutex);
> >> @@ -97,6 +105,8 @@ struct blkfront_info
> >>       struct work_struct work;
> >>       struct gnttab_free_callback callback;
> >>       struct blk_shadow shadow[BLK_RING_SIZE];
> >> +     struct llist_head persistent_gnts;
> >> +     unsigned int persistent_gnts_c;
> >>       unsigned long shadow_free;
> >>       unsigned int feature_flush;
> >>       unsigned int flush_op;
> >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
> >>       unsigned long id;
> >>       unsigned int fsect, lsect;
> >>       int i, ref;
> >> +
> >> +     /*
> >> +      * Used to store if we are able to queue the request by just using
> >> +      * existing persistent grants (0), or if we have to get new grants,
> > 
> > What does the zero mean?
> 
> Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it.
> 
> >> +      * as there are not sufficiently many free.
> >> +      */
> >> +     int new_persistent_gnts;
> > 
> > I think this can be a bool?
> 
> I agree.
> 
> >>       grant_ref_t gref_head;
> >> +     struct page *granted_page;
> >> +     struct grant *gnt_list_entry = NULL;
> >>       struct scatterlist *sg;
> >>
> >>       if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> >>               return 1;
> >>
> >> -     if (gnttab_alloc_grant_references(
> >> -             BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> >> -             gnttab_request_free_callback(
> >> -                     &info->callback,
> >> -                     blkif_restart_queue_callback,
> >> -                     info,
> >> -                     BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> -             return 1;
> >> -     }
> >> +     /* Check if we have enought grants to allocate a requests */
> >> +     if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> >> +             new_persistent_gnts = 1;
> >> +             if (gnttab_alloc_grant_references(
> >> +                 BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
> >> +                 &gref_head) < 0) {
> >> +                     gnttab_request_free_callback(
> >> +                             &info->callback,
> >> +                             blkif_restart_queue_callback,
> >> +                             info,
> >> +                             BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> +                     return 1;
> >> +             }
> >> +     } else
> >> +             new_persistent_gnts = 0;
> >>
> >>       /* Fill out a communications ring structure. */
> >>       ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
> >>                      BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >>
> >>               for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> >> -                     buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
> >>                       fsect = sg->offset >> 9;
> >>                       lsect = fsect + (sg->length >> 9) - 1;
> >> -                     /* install a grant reference. */
> >> -                     ref = gnttab_claim_grant_reference(&gref_head);
> >> -                     BUG_ON(ref == -ENOSPC);
> >>
> >> -                     gnttab_grant_foreign_access_ref(
> >> -                                     ref,
> >> +                     if (info->persistent_gnts_c) {
> >> +                             BUG_ON(llist_empty(&info->persistent_gnts));
> >> +                             gnt_list_entry = llist_entry(
> >> +                                     llist_del_first(&info->persistent_gnts),
> >> +                                     struct grant, node);
> >> +
> >> +                             ref = gnt_list_entry->gref;
> >> +                             buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> >> +                             info->persistent_gnts_c--;
> >> +                     } else {
> >> +                             ref = gnttab_claim_grant_reference(&gref_head);
> >> +                             BUG_ON(ref == -ENOSPC);
> >> +
> >> +                             gnt_list_entry =
> >> +                                     kmalloc(sizeof(struct grant),
> >> +                                                      GFP_ATOMIC);
> >> +                             if (!gnt_list_entry)
> >> +                                     return -ENOMEM;
> >> +
> >> +                             granted_page = alloc_page(GFP_ATOMIC);
> >> +                             if (!granted_page) {
> >> +                                     kfree(gnt_list_entry);
> >> +                                     return -ENOMEM;
> >> +                             }
> >> +
> >> +                             gnt_list_entry->pfn =
> >> +                                     page_to_pfn(granted_page);
> >> +                             gnt_list_entry->gref = ref;
> >> +
> >> +                             buffer_mfn = pfn_to_mfn(page_to_pfn(
> >> +                                                             granted_page));
> >> +                             gnttab_grant_foreign_access_ref(ref,
> >>                                       info->xbdev->otherend_id,
> >> -                                     buffer_mfn,
> >> -                                     rq_data_dir(req));
> >> +                                     buffer_mfn, 0);
> >> +                     }
> >> +
> >> +                     info->shadow[id].grants_used[i] = gnt_list_entry;
> >> +
> >> +                     if (rq_data_dir(req)) {
> >> +                             char *bvec_data;
> >> +                             void *shared_data;
> >> +
> >> +                             BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> >> +
> >> +                             shared_data = kmap_atomic(
> >> +                                     pfn_to_page(gnt_list_entry->pfn));
> >> +                             bvec_data = kmap_atomic(sg_page(sg));
> >> +
> >> +                             /*
> >> +                              * this does not wipe data stored outside the
> >> +                              * range sg->offset..sg->offset+sg->length.
> >> +                              * Therefore, blkback *could* see data from
> >> +                              * previous requests. This is OK as long as
> >> +                              * persistent grants are shared with just one
> >> +                              * domain. It may need refactoring if This
> > .. this (lowercase it pls)
> 
> Done.
> 
> > 
> >> +                              * changes
> >> +                              */
> >> +                             memcpy(shared_data + sg->offset,
> >> +                                    bvec_data   + sg->offset,
> >> +                                    sg->length);
> >> +
> >> +                             kunmap_atomic(bvec_data);
> >> +                             kunmap_atomic(shared_data);
> >> +                     }
> >>
> >>                       info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> >>                       ring_req->u.rw.seg[i] =
> >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
> >>       /* Keep a private copy so we can reissue requests when recovering. */
> >>       info->shadow[id].req = *ring_req;
> >>
> >> -     gnttab_free_grant_references(gref_head);
> >> +     if (new_persistent_gnts)
> >> +             gnttab_free_grant_references(gref_head);
> >>
> >>       return 0;
> >>  }
> >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> >>  static void xlvbd_flush(struct blkfront_info *info)
> >>  {
> >>       blk_queue_flush(info->rq, info->feature_flush);
> >> -     printk(KERN_INFO "blkfront: %s: %s: %s\n",
> >> +     printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
> > 
> > HA! By default, eh?
> 
> Yes, you caught me, there's a paragraph in the commit message that
> explains that we are using persistent grants in the frontend
> unconditionally, since the protocol is compatible (you can have a
> persistent blkfront and a non-persistent blkback). It simplifies the
> logic in blkfront. Are you OK with it?

It is OK, but you should be checking whether the backend supports it.
I don't see it checking the info->feature_persistent_grant to print
that.

> 
> >>              info->gd->disk_name,
> >>              info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> >>               "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
> >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
> >>
> >>  static void blkif_free(struct blkfront_info *info, int suspend)
> >>  {
> >> +     struct llist_node *all_gnts;
> >> +     struct grant *persistent_gnt;
> >> +
> >>       /* Prevent new requests being issued until we fix things up. */
> >>       spin_lock_irq(&info->io_lock);
> >>       info->connected = suspend ?
> >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> >>       /* No more blkif_request(). */
> >>       if (info->rq)
> >>               blk_stop_queue(info->rq);
> >> +
> >> +     /* Remove all persistent grants */
> >> +     if (info->persistent_gnts_c) {
> >> +             all_gnts = llist_del_all(&info->persistent_gnts);
> >> +             llist_for_each_entry(persistent_gnt, all_gnts, node) {
> >> +                     gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> >> +                     kfree(persistent_gnt);
> >> +             }
> >> +             info->persistent_gnts_c = 0;
> >> +     }
> >> +
> >>       /* No more gnttab callback work. */
> >>       gnttab_cancel_free_callback(&info->callback);
> >>       spin_unlock_irq(&info->io_lock);
> >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> >>
> >>  }
> >>
> >> -static void blkif_completion(struct blk_shadow *s)
> >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> >> +                          struct blkif_response *bret)
> >>  {
> >>       int i;
> >> -     /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> >> -      * flag. */
> >> -     for (i = 0; i < s->req.u.rw.nr_segments; i++)
> >> -             gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> >> +     struct bio_vec *bvec;
> >> +     struct req_iterator iter;
> >> +     unsigned long flags;
> >> +     char *bvec_data;
> >> +     void *shared_data;
> >> +     unsigned int offset = 0;
> >> +
> >> +     if (bret->operation == BLKIF_OP_READ) {
> >> +             /*
> >> +              * Copy the data received from the backend into the bvec.
> >> +              * Since bv_len can be different from PAGE_SIZE, we need to
> >> +              * be sure we are actually copying the data from the right
> >> +              * shared page.
> >> +              */
> >> +             rq_for_each_segment(bvec, s->request, iter) {
> >> +                     BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
> >> +                     i = offset >> PAGE_SHIFT;
> > 
> > Could you also include a comment about the bug you found here, pls?
> 
> There's a comment before the rq_for_each_segment loop, that tries to
> explain that, do you think the following is better?
> 
> /*
>  * Copy the data received from the backend into the bvec.
>  * Since bv_offset can be different than 0, and bv_len different
>  * than PAGE_SIZE, we have to keep track of the current offset,
>  * to be sure we are copying the data from the right shared page.

Yes. That is good.
>  */
> 
> >> +                     shared_data = kmap_atomic(
> >> +                             pfn_to_page(s->grants_used[i]->pfn));
> >> +                     bvec_data = bvec_kmap_irq(bvec, &flags);
> >> +                     memcpy(bvec_data, shared_data + bvec->bv_offset,
> >> +                             bvec->bv_len);
> >> +                     bvec_kunmap_irq(bvec_data, &flags);
> >> +                     kunmap_atomic(shared_data);
> >> +                     offset += bvec->bv_len;
> >> +             }
> >> +     }
> >> +     /* Add the persistent grant into the list of free grants */
> >> +     for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> >> +             llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
> >> +             info->persistent_gnts_c++;
> >> +     }
> >>  }
> >>
> >>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> >>               req  = info->shadow[id].request;
> >>
> >>               if (bret->operation != BLKIF_OP_DISCARD)
> >> -                     blkif_completion(&info->shadow[id]);
> >> +                     blkif_completion(&info->shadow[id], info, bret);
> >>
> >>               if (add_id_to_freelist(info, id)) {
> >>                       WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> >> @@ -942,6 +1066,11 @@ again:
> >>               message = "writing protocol";
> >>               goto abort_transaction;
> >>       }
> >> +     err = xenbus_printf(xbt, dev->nodename,
> >> +                         "feature-persistent-grants", "%d", 1);
> > 
> > So its %u in blkback, but %d in here? Which one should it be?
> 
> %u in both places.
> 
> >> +     if (err)
> >> +             dev_warn(&dev->dev,
> >> +                      "writing persistent grants feature to xenbus");
> >>
> >>       err = xenbus_transaction_end(xbt, 0);
> >>       if (err) {
> >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
> >>       spin_lock_init(&info->io_lock);
> >>       info->xbdev = dev;
> >>       info->vdevice = vdevice;
> >> +     init_llist_head(&info->persistent_gnts);
> >> +     info->persistent_gnts_c = 0;
> >>       info->connected = BLKIF_STATE_DISCONNECTED;
> >>       INIT_WORK(&info->work, blkif_restart_queue);
> >>
> >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
> >>                                       req->u.rw.seg[j].gref,
> >>                                       info->xbdev->otherend_id,
> >>                                       pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> >> -                                     rq_data_dir(info->shadow[req->u.rw.id].request));
> >> +                                     0);
> >>               }
> >>               info->shadow[req->u.rw.id].req = *req;
> >>
> >> --
> >> 1.7.7.5 (Apple Git-26)
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@...r.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/