linux-kernel - Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5612ADEB.40002@citrix.com>
Date:	Mon, 5 Oct 2015 18:05:47 +0100
From:	Julien Grall <julien.grall@...rix.com>
To:	Roger Pau Monné <roger.pau@...rix.com>,
	<xen-devel@...ts.xenproject.org>
CC:	<ian.campbell@...rix.com>, <stefano.stabellini@...citrix.com>,
	<linux-kernel@...r.kernel.org>,
	David Vrabel <david.vrabel@...rix.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>
Subject: Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect
 grant with 64KB pages

On 05/10/15 17:01, Roger Pau Monné wrote:
> El 11/09/15 a les 21.32, Julien Grall ha escrit:
>> The minimal size of request in the block framework is always PAGE_SIZE.
>> It means that when 64KB guest is support, the request will at least be
>> 64KB.
>>
>> Although, if the backend doesn't support indirect grant (such as QDISK
>> in QEMU), a ring request is only able to accomodate 11 segments of 4KB
>> (i.e 44KB).
>>
>> The current frontend is assuming that an I/O request will always fit in
>> a ring request. This is not true any more when using 64KB page
>> granularity and will therefore crash during the boot.
>                                        ^ during boot.
>>
>> On ARM64, the ABI is completely neutral to the page granularity used by
>> the domU. The guest has the choice between different page granularity
>> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
>> This can't be enforced by the hypervisor and therefore it's possible to
>> run guests using different page granularity.
>>
>> So we can't mandate the block backend to support non-indirect grant
>> when the frontend is using 64KB page granularity and have to fix it
>> properly in the frontend.
>>
>> The solution exposed below is based on modifying directly the frontend
>> guest rather than asking the block framework to support smaller size
>> (i.e < PAGE_SIZE). This is because the change is the block framework are
>> not trivial as everything seems to relying on a struct *page (see [1]).
>> Although, it may be possible that someone succeed to do it in the future
>> and we would therefore be able to use advantage.
>                                        ^ it. (no advantage IMHO)
>>
>> Given that a block request may not fit in a single ring request, a
>> second request is introduced for the data that cannot fit in the first
>> one. This means that the second request should never be used on Linux
>> configuration using a page granularity < 44KB.
>   ^ if the page size is smaller than 44KB.
>>
>> Note that the parameters blk_queue_max_* helpers haven't been updated.
>> The block code will set mimimum size supported and we may be able  to
>                           ^ the minimum                extra space ^
>> support directly any change in the block framework that lower down the
>> mimimal size of a request.
>   ^ minimal
> 
> I have a concern regarding the splitting done in this patch.
> 
> What happens with FUA requests when split? For example the frontend
> receives a FUA requests with 64KB of data, and this is split into two
> different requests on the ring, is this going to cause trouble in the
> higher layers if for example the first request is completed but the
> second is not? Could we leave the disk in a bad state as a consequence
> of this?

If a block request is split into two ring requests, we will wait the two
responses before reporting the completion to the higher layers (see
blkif_interrupt and blkif_completion).

Furthermore, the second ring request will always use the same operation
as the first one. Note that you will flush twice which is not nice but
could be improved.

> 
>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
>>
>> Signed-off-by: Julien Grall <julien.grall@...rix.com>
>>
>> ---
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
>> Cc: "Roger Pau Monné" <roger.pau@...rix.com>
>> Cc: Boris Ostrovsky <boris.ostrovsky@...cle.com>
>> Cc: David Vrabel <david.vrabel@...rix.com>
>> ---
>>  drivers/block/xen-blkfront.c | 199 +++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 183 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index f9d55c3..03772c9 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -60,6 +60,20 @@
>>  
>>  #include <asm/xen/hypervisor.h>
>>  
>> +/*
>> + * The block framework is always working on segment of PAGE_SIZE minimum.
> 
> The above sentence needs to be reworded.

What about:

"The mininal size of the segment supported by the block framework is
PAGE_SIZE."

> 
>> + * When Linux is using a different page size than xen, it may not be possible
>> + * to put all the data in a single segment.
>> + * This can happen when the backend doesn't support indirect grant and
>                                      indirect requests ^
>> + * therefore the maximum amount of data that a request can carry is
>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
>> + *
>> + * Note that we only support one extra request. So the Linux page size
>> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
>> + * 88KB.
>> + */
>> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
> 
> Since you are already introducing a preprocessor define, I would like
> the code added to be protected by it.

Most of the code is protected by an if (HAS_REQ_EXTRA && ...) or
variable setting based on HAS_REQ_EXTRA. Furthermore there is extra
protection with some BUG_ON.

I would rather avoid to use the preprocessor to avoid ending up with:

#ifdef HAS_EXTRA_REQ
/* Code */
#else
/* Code */
#endif

and potentially miss update on the ARM64 with 64KB pages because
currently most of people are developing PV drivers on x86.

>>  enum blkif_state {
>>  	BLKIF_STATE_DISCONNECTED,
>>  	BLKIF_STATE_CONNECTED,
>> @@ -79,6 +93,19 @@ struct blk_shadow {
>>  	struct grant **indirect_grants;
>>  	struct scatterlist *sg;
>>  	unsigned int num_sg;
> 
> #if HAS_EXTRA_REQ
> 
>> +	enum
>> +	{
>> +		REQ_WAITING,
>> +		REQ_DONE,
>> +		REQ_FAIL
>> +	} status;
>> +
>> +	#define NO_ASSOCIATED_ID ~0UL
>> +	/*
>> +	 * Id of the sibling if we ever need 2 requests when handling a
>> +	 * block I/O request
>> +	 */
>> +	unsigned long associated_id;
> 
> #endif

See my remark above.

> 
>>  };
>>  
>>  struct split_bio {
>> @@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
>>  
>>  	id = get_id_from_freelist(info);
>>  	info->shadow[id].request = req;
>> +	info->shadow[id].status = REQ_WAITING;
>> +	info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>>  
>>  	(*ring_req)->u.rw.id = id;
>>  
>> @@ -508,6 +537,9 @@ struct setup_rw_req {
>>  	bool need_copy;
>>  	unsigned int bvec_off;
>>  	char *bvec_data;
>> +
>> +	bool require_extra_req;
>> +	struct blkif_request *ring_req2;
> 
> extra_ring_req?

Will do.


>>  };
>>  
>>  static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>> @@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>  	unsigned int grant_idx = setup->grant_idx;
>>  	struct blkif_request *ring_req = setup->ring_req;
>>  	struct blkfront_info *info = setup->info;
>> +	/*
>> +	 * We always use the shadow of the first request to store the list
>> +	 * of grant associated to the block I/O request. This made the
>               ^ grants                                        ^ makes
>> +	 * completion more easy to handle even if the block I/O request is
>> +	 * split.
>> +	 */
>>  	struct blk_shadow *shadow = &info->shadow[setup->id];
>>  
>> +	if (unlikely(setup->require_extra_req &&
>> +		     grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
>> +		/*
>> +		 * We are using the second request, setup grant_idx
>> +		 * to be the index of the segment array
>> +		 */
>> +		grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +		ring_req = setup->ring_req2;
>> +	}
>> +
>>  	if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
>>  	    (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
>>  		if (setup->segments)
>> @@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>  
>>  	gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
>>  	ref = gnt_list_entry->gref;
>> -	shadow->grants_used[grant_idx] = gnt_list_entry;
>> +	/*
>> +	 * All the grants are stored in the shadow of the first
>> +	 * request. Therefore we have to use the global index
>> +	 */
>> +	shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>>  
>>  	if (setup->need_copy) {
>>  		void *shared_data;
>> @@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>  	(setup->grant_idx)++;
>>  }
>>  
>> +static void blkif_setup_extra_req(struct blkif_request *first,
>> +				  struct blkif_request *second)
>> +{
>> +	uint16_t nr_segments = first->u.rw.nr_segments;
>> +
>> +	/*
>> +	 * The second request is only present when the first request uses
>> +	 * all its segments. It's always the continuity of the first one
>> +	 */
>> +	first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +
>> +	second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +	second->u.rw.sector_number = first->u.rw.sector_number +
>> +		(BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
> 
> Does this need to take into account if sectors have been skipped in the
> previous requests due to empty data?

I'm not sure to understand this question.

AFAIU, the data is always contiguous in a block segment even though it
can be split accross multiple Linux page. The number of segments is
calculated from the amount of data see "Calculate the number of grant
used" the code. The second request will only be created if this number
is greater than BLKIF_MAX_SEGMENTS_PER_REQUEST.

So the second request will always be the continuity of the first one.

> Also I'm not sure how correct it is to hardcode 512 here.

Well, we are hardcoding it in blkfront and the block framework does the
same (see blk_limits_max_hw_sectors).

> 
>> +
>> +	second->u.rw.handle = first->u.rw.handle;
>> +	second->operation = first->operation;
>> +}
>> +
>>  static int blkif_queue_rw_req(struct request *req)
>>  {
>>  	struct blkfront_info *info = req->rq_disk->private_data;
>> -	struct blkif_request *ring_req;
>> -	unsigned long id;
>> +	struct blkif_request *ring_req, *ring_req2 = NULL;
>> +	unsigned long id, id2 = NO_ASSOCIATED_ID;
>> +	bool require_extra_req = false;
>>  	int i;
>>  	struct setup_rw_req setup = {
>>  		.grant_idx = 0,
>> @@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req)
>>  	/* Fill out a communications ring structure. */
>>  	id = blkif_ring_get_request(info, req, &ring_req);
>>  
>> -	BUG_ON(info->max_indirect_segments == 0 &&
>> -	       GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> -	BUG_ON(info->max_indirect_segments &&
>> -	       GREFS(req->nr_phys_segments) > info->max_indirect_segments);
>> -
>>  	num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>>  	num_grant = 0;
>>  	/* Calculate the number of grant used */
>>  	for_each_sg(info->shadow[id].sg, sg, num_sg, i)
>>  	       num_grant += gnttab_count_grant(sg->offset, sg->length);
>>  
>> +	require_extra_req = info->max_indirect_segments == 0 &&
>> +		num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +	BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
>> +
>>  	info->shadow[id].num_sg = num_sg;
>> -	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>> +	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
>> +	    likely(!require_extra_req)) {
>>  		/*
>>  		 * The indirect operation can only be a BLKIF_OP_READ or
>>  		 * BLKIF_OP_WRITE
>> @@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req)
>>  			}
>>  		}
>>  		ring_req->u.rw.nr_segments = num_grant;
>> +		if (unlikely(require_extra_req)) {
>> +			id2 = blkif_ring_get_request(info, req, &ring_req2);
> 
> How can you guarantee that there's always going to be another free
> request? AFAICT blkif_queue_rq checks for RING_FULL, but you don't
> actually know if there's only one slot or more than one available.

Because the depth of the queue is divided by 2 when the extra request is
used (see xlvbd_init_blk_queue).

Regards,

-- 
Julien Grall
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/