linux-kernel - Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <561396DF.9040406@citrix.com>
Date:	Tue, 6 Oct 2015 11:39:43 +0200
From:	Roger Pau Monné <roger.pau@...rix.com>
To:	Julien Grall <julien.grall@...rix.com>,
	<xen-devel@...ts.xenproject.org>
CC:	<ian.campbell@...rix.com>, <stefano.stabellini@...citrix.com>,
	<linux-kernel@...r.kernel.org>,
	David Vrabel <david.vrabel@...rix.com>,
	"Boris Ostrovsky" <boris.ostrovsky@...cle.com>
Subject: Re: [Xen-devel] [PATCH 2/2] block/xen-blkfront: Handle non-indirect
 grant with 64KB pages

El 05/10/15 a les 19.05, Julien Grall ha escrit:
> On 05/10/15 17:01, Roger Pau Monné wrote:
>> El 11/09/15 a les 21.32, Julien Grall ha escrit:
>>> The minimal size of request in the block framework is always PAGE_SIZE.
>>> It means that when 64KB guest is support, the request will at least be
>>> 64KB.
>>>
>>> Although, if the backend doesn't support indirect grant (such as QDISK
>>> in QEMU), a ring request is only able to accomodate 11 segments of 4KB
>>> (i.e 44KB).
>>>
>>> The current frontend is assuming that an I/O request will always fit in
>>> a ring request. This is not true any more when using 64KB page
>>> granularity and will therefore crash during the boot.
>>                                        ^ during boot.
>>>
>>> On ARM64, the ABI is completely neutral to the page granularity used by
>>> the domU. The guest has the choice between different page granularity
>>> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
>>> This can't be enforced by the hypervisor and therefore it's possible to
>>> run guests using different page granularity.
>>>
>>> So we can't mandate the block backend to support non-indirect grant
>>> when the frontend is using 64KB page granularity and have to fix it
>>> properly in the frontend.
>>>
>>> The solution exposed below is based on modifying directly the frontend
>>> guest rather than asking the block framework to support smaller size
>>> (i.e < PAGE_SIZE). This is because the change is the block framework are
>>> not trivial as everything seems to relying on a struct *page (see [1]).
>>> Although, it may be possible that someone succeed to do it in the future
>>> and we would therefore be able to use advantage.
>>                                        ^ it. (no advantage IMHO)
>>>
>>> Given that a block request may not fit in a single ring request, a
>>> second request is introduced for the data that cannot fit in the first
>>> one. This means that the second request should never be used on Linux
>>> configuration using a page granularity < 44KB.
>>   ^ if the page size is smaller than 44KB.
>>>
>>> Note that the parameters blk_queue_max_* helpers haven't been updated.
>>> The block code will set mimimum size supported and we may be able  to
>>                           ^ the minimum                extra space ^
>>> support directly any change in the block framework that lower down the
>>> mimimal size of a request.
>>   ^ minimal
>>
>> I have a concern regarding the splitting done in this patch.
>>
>> What happens with FUA requests when split? For example the frontend
>> receives a FUA requests with 64KB of data, and this is split into two
>> different requests on the ring, is this going to cause trouble in the
>> higher layers if for example the first request is completed but the
>> second is not? Could we leave the disk in a bad state as a consequence
>> of this?
> 
> If a block request is split into two ring requests, we will wait the two
> responses before reporting the completion to the higher layers (see
> blkif_interrupt and blkif_completion).
> 
> Furthermore, the second ring request will always use the same operation
> as the first one. Note that you will flush twice which is not nice but
> could be improved.
> 
>>
>>> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
>>>
>>> Signed-off-by: Julien Grall <julien.grall@...rix.com>
>>>
>>> ---
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
>>> Cc: "Roger Pau Monné" <roger.pau@...rix.com>
>>> Cc: Boris Ostrovsky <boris.ostrovsky@...cle.com>
>>> Cc: David Vrabel <david.vrabel@...rix.com>
>>> ---
>>>  drivers/block/xen-blkfront.c | 199 +++++++++++++++++++++++++++++++++++++++----
>>>  1 file changed, 183 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>>> index f9d55c3..03772c9 100644
>>> --- a/drivers/block/xen-blkfront.c
>>> +++ b/drivers/block/xen-blkfront.c
>>> @@ -60,6 +60,20 @@
>>>  
>>>  #include <asm/xen/hypervisor.h>
>>>  
>>> +/*
>>> + * The block framework is always working on segment of PAGE_SIZE minimum.
>>
>> The above sentence needs to be reworded.
> 
> What about:
> 
> "The mininal size of the segment supported by the block framework is
> PAGE_SIZE."

That sounds fine.

> 
>>
>>> + * When Linux is using a different page size than xen, it may not be possible
>>> + * to put all the data in a single segment.
>>> + * This can happen when the backend doesn't support indirect grant and
>>                                      indirect requests ^
>>> + * therefore the maximum amount of data that a request can carry is
>>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
>>> + *
>>> + * Note that we only support one extra request. So the Linux page size
>>> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
>>> + * 88KB.
>>> + */
>>> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
>>
>> Since you are already introducing a preprocessor define, I would like
>> the code added to be protected by it.
> 
> Most of the code is protected by an if (HAS_REQ_EXTRA && ...) or
> variable setting based on HAS_REQ_EXTRA. Furthermore there is extra
> protection with some BUG_ON.
> 
> I would rather avoid to use the preprocessor to avoid ending up with:
> 
> #ifdef HAS_EXTRA_REQ
> /* Code */
> #else
> /* Code */
> #endif
> 
> and potentially miss update on the ARM64 with 64KB pages because
> currently most of people are developing PV drivers on x86.

Ok, I was also thinking whether a compile time assert would be fine:

BUILD_BUG_ON(PAGE_SIZE > 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST *
XEN_PAGE_SIZE);

But then if indirect descriptors are available the assert is no longer true.

>>>  enum blkif_state {
>>>  	BLKIF_STATE_DISCONNECTED,
>>>  	BLKIF_STATE_CONNECTED,
>>> @@ -79,6 +93,19 @@ struct blk_shadow {
>>>  	struct grant **indirect_grants;
>>>  	struct scatterlist *sg;
>>>  	unsigned int num_sg;
>>
>> #if HAS_EXTRA_REQ
>>
>>> +	enum
>>> +	{
>>> +		REQ_WAITING,
>>> +		REQ_DONE,
>>> +		REQ_FAIL
>>> +	} status;
>>> +
>>> +	#define NO_ASSOCIATED_ID ~0UL
>>> +	/*
>>> +	 * Id of the sibling if we ever need 2 requests when handling a
>>> +	 * block I/O request
>>> +	 */
>>> +	unsigned long associated_id;
>>
>> #endif
> 
> See my remark above.
> 
>>
>>>  };
>>>  
>>>  struct split_bio {
>>> @@ -467,6 +494,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
>>>  
>>>  	id = get_id_from_freelist(info);
>>>  	info->shadow[id].request = req;
>>> +	info->shadow[id].status = REQ_WAITING;
>>> +	info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>>>  
>>>  	(*ring_req)->u.rw.id = id;
>>>  
>>> @@ -508,6 +537,9 @@ struct setup_rw_req {
>>>  	bool need_copy;
>>>  	unsigned int bvec_off;
>>>  	char *bvec_data;
>>> +
>>> +	bool require_extra_req;
>>> +	struct blkif_request *ring_req2;
>>
>> extra_ring_req?
> 
> Will do.
> 
> 
>>>  };
>>>  
>>>  static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>> @@ -521,8 +553,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>>  	unsigned int grant_idx = setup->grant_idx;
>>>  	struct blkif_request *ring_req = setup->ring_req;
>>>  	struct blkfront_info *info = setup->info;
>>> +	/*
>>> +	 * We always use the shadow of the first request to store the list
>>> +	 * of grant associated to the block I/O request. This made the
>>               ^ grants                                        ^ makes
>>> +	 * completion more easy to handle even if the block I/O request is
>>> +	 * split.
>>> +	 */
>>>  	struct blk_shadow *shadow = &info->shadow[setup->id];
>>>  
>>> +	if (unlikely(setup->require_extra_req &&
>>> +		     grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
>>> +		/*
>>> +		 * We are using the second request, setup grant_idx
>>> +		 * to be the index of the segment array
>>> +		 */
>>> +		grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +		ring_req = setup->ring_req2;
>>> +	}
>>> +
>>>  	if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
>>>  	    (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
>>>  		if (setup->segments)
>>> @@ -537,7 +585,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>>  
>>>  	gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
>>>  	ref = gnt_list_entry->gref;
>>> -	shadow->grants_used[grant_idx] = gnt_list_entry;
>>> +	/*
>>> +	 * All the grants are stored in the shadow of the first
>>> +	 * request. Therefore we have to use the global index
>>> +	 */
>>> +	shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>>>  
>>>  	if (setup->need_copy) {
>>>  		void *shared_data;
>>> @@ -579,11 +631,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>>>  	(setup->grant_idx)++;
>>>  }
>>>  
>>> +static void blkif_setup_extra_req(struct blkif_request *first,
>>> +				  struct blkif_request *second)
>>> +{
>>> +	uint16_t nr_segments = first->u.rw.nr_segments;
>>> +
>>> +	/*
>>> +	 * The second request is only present when the first request uses
>>> +	 * all its segments. It's always the continuity of the first one
>>> +	 */
>>> +	first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +
>>> +	second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +	second->u.rw.sector_number = first->u.rw.sector_number +
>>> +		(BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
>>
>> Does this need to take into account if sectors have been skipped in the
>> previous requests due to empty data?
> 
> I'm not sure to understand this question.
> 
> AFAIU, the data is always contiguous in a block segment even though it
> can be split accross multiple Linux page. The number of segments is
> calculated from the amount of data see "Calculate the number of grant
> used" the code. The second request will only be created if this number
> is greater than BLKIF_MAX_SEGMENTS_PER_REQUEST.
> 
> So the second request will always be the continuity of the first one.

Ack.

>> Also I'm not sure how correct it is to hardcode 512 here.
> 
> Well, we are hardcoding it in blkfront and the block framework does the
> same (see blk_limits_max_hw_sectors).

Right, seems like the whole block system is always working with 512 as
the basic unit.

>>
>>> +
>>> +	second->u.rw.handle = first->u.rw.handle;
>>> +	second->operation = first->operation;
>>> +}
>>> +
>>>  static int blkif_queue_rw_req(struct request *req)
>>>  {
>>>  	struct blkfront_info *info = req->rq_disk->private_data;
>>> -	struct blkif_request *ring_req;
>>> -	unsigned long id;
>>> +	struct blkif_request *ring_req, *ring_req2 = NULL;
>>> +	unsigned long id, id2 = NO_ASSOCIATED_ID;
>>> +	bool require_extra_req = false;
>>>  	int i;
>>>  	struct setup_rw_req setup = {
>>>  		.grant_idx = 0,
>>> @@ -628,19 +700,19 @@ static int blkif_queue_rw_req(struct request *req)
>>>  	/* Fill out a communications ring structure. */
>>>  	id = blkif_ring_get_request(info, req, &ring_req);
>>>  
>>> -	BUG_ON(info->max_indirect_segments == 0 &&
>>> -	       GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>> -	BUG_ON(info->max_indirect_segments &&
>>> -	       GREFS(req->nr_phys_segments) > info->max_indirect_segments);
>>> -
>>>  	num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>>>  	num_grant = 0;
>>>  	/* Calculate the number of grant used */
>>>  	for_each_sg(info->shadow[id].sg, sg, num_sg, i)
>>>  	       num_grant += gnttab_count_grant(sg->offset, sg->length);
>>>  
>>> +	require_extra_req = info->max_indirect_segments == 0 &&
>>> +		num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
>>> +	BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
>>> +
>>>  	info->shadow[id].num_sg = num_sg;
>>> -	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>>> +	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
>>> +	    likely(!require_extra_req)) {
>>>  		/*
>>>  		 * The indirect operation can only be a BLKIF_OP_READ or
>>>  		 * BLKIF_OP_WRITE
>>> @@ -680,10 +752,29 @@ static int blkif_queue_rw_req(struct request *req)
>>>  			}
>>>  		}
>>>  		ring_req->u.rw.nr_segments = num_grant;
>>> +		if (unlikely(require_extra_req)) {
>>> +			id2 = blkif_ring_get_request(info, req, &ring_req2);
>>
>> How can you guarantee that there's always going to be another free
>> request? AFAICT blkif_queue_rq checks for RING_FULL, but you don't
>> actually know if there's only one slot or more than one available.
> 
> Because the depth of the queue is divided by 2 when the extra request is
> used (see xlvbd_init_blk_queue).

I see, that's quite restrictive but I guess it's better than introducing
a new ring macro in order to figure out if there are at least two free
slots.

Roger.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/