lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 16 Jun 2015 19:57:24 +0800
From:	juncheng bai <baijuncheng@...tedstack.com>
To:	Ilya Dryomov <idryomov@...il.com>
CC:	idryomov@...hat.com, Alex Elder <elder@...aro.org>,
	Josh Durgin <josh.durgin@...tank.com>,
	Guangliang Zhao <lucienchao@...il.com>, jeff@...zik.org,
	yehuda@...newdream.net, Sage Weil <sage@...dream.net>,
	elder@...tank.com,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Ceph Development <ceph-devel@...r.kernel.org>
Subject: Re: [PATCH RFC] storage:rbd: make the size of request is equal to
 the, size of the object



On 2015/6/16 16:37, Ilya Dryomov wrote:
> On Tue, Jun 16, 2015 at 6:28 AM, juncheng bai
> <baijuncheng@...tedstack.com> wrote:
>>
>>
>> On 2015/6/15 22:27, Ilya Dryomov wrote:
>>>
>>> On Mon, Jun 15, 2015 at 4:23 PM, juncheng bai
>>> <baijuncheng@...tedstack.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2015/6/15 21:03, Ilya Dryomov wrote:
>>>>>
>>>>>
>>>>> On Mon, Jun 15, 2015 at 2:18 PM, juncheng bai
>>>>> <baijuncheng@...tedstack.com> wrote:
>>>>>>
>>>>>>
>>>>>>    From 6213215bd19926d1063d4e01a248107dab8a899b Mon Sep 17 00:00:00
>>>>>> 2001
>>>>>> From: juncheng bai <baijuncheng@...tedstack.com>
>>>>>> Date: Mon, 15 Jun 2015 18:34:00 +0800
>>>>>> Subject: [PATCH] storage:rbd: make the size of request is equal to the
>>>>>>     size of the object
>>>>>>
>>>>>> ensures that the merged size of request can achieve the size of
>>>>>> the object.
>>>>>> when merge a bio to request or merge a request to request, the
>>>>>> sum of the segment number of the current request and the segment
>>>>>> number of the bio is not greater than the max segments of the request,
>>>>>> so the max size of request is 512k if the max segments of request is
>>>>>> BLK_MAX_SEGMENTS.
>>>>>>
>>>>>> Signed-off-by: juncheng bai <baijuncheng@...tedstack.com>
>>>>>> ---
>>>>>>     drivers/block/rbd.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>> index 0a54c58..dec6045 100644
>>>>>> --- a/drivers/block/rbd.c
>>>>>> +++ b/drivers/block/rbd.c
>>>>>> @@ -3757,6 +3757,8 @@ static int rbd_init_disk(struct rbd_device
>>>>>> *rbd_dev)
>>>>>>            segment_size = rbd_obj_bytes(&rbd_dev->header);
>>>>>>            blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
>>>>>>            blk_queue_max_segment_size(q, segment_size);
>>>>>> +       if (segment_size > BLK_MAX_SEGMENTS * PAGE_SIZE)
>>>>>> +               blk_queue_max_segments(q, segment_size / PAGE_SIZE);
>>>>>>            blk_queue_io_min(q, segment_size);
>>>>>>            blk_queue_io_opt(q, segment_size);
>>>>>
>>>>>
>>>>>
>>>>> I made a similar patch on Friday, investigating blk-mq plugging issue
>>>>> reported by Nick.  My patch sets it to BIO_MAX_PAGES unconditionally -
>>>>> AFAIU there is no point in setting to anything bigger since the bios
>>>>> will be clipped to that number of vecs.  Given that BIO_MAX_PAGES is
>>>>> 256, this gives is 1M direct I/Os.
>>>>
>>>>
>>>> Hi. For signal bio, the max number of bio_vec is BIO_MAX_PAGES, but a
>>>> request can be merged from multiple bios. We can see the below function:
>>>> ll_back_merge_fn, ll_front_merge_fn and etc.
>>>> And I test in kernel 3.18 use this patch, and do:
>>>> echo 4096 > /sys/block/rbd0/queue/max_sectors_kb
>>>> We use systemtap to trace the request size, It is upto 4M.
>>>
>>>
>>> Kernel 3.18 is pre rbd blk-mq transition, which happened in 4.0.  You
>>> should test whatever patches you have with at least 4.0.
>>>
>>> Putting that aside, I must be missing something.  You'll get 4M
>>> requests on 3.18 both with your patch and without it, the only
>>> difference would be the size of bios being merged - 512k vs 1M.  Can
>>> you describe your test workload and provide before and after traces?
>>>
>> Hi. I update kernel version to 4.0.5. The test information as shown below:
>> The base information:
>> 03:28:13-root@...ver-186:~$uname -r
>> 4.0.5
>>
>> My simple systemtap script:
>> probe module("rbd").function("rbd_img_request_create")
>> {
>>      printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3));
>> }
>>
>> I use dd to execute the test case:
>> dd if=/dev/zero  of=/dev/rbd0 bs=4M count=1 oflag=direct
>>
>> Case one: Without patch
>> 03:30:23-root@...ver-186:~$cat /sys/block/rbd0/queue/max_sectors_kb
>> 4096
>> 03:30:35-root@...ver-186:~$cat /sys/block/rbd0/queue/max_segments
>> 128
>>
>> The output of systemtap for nornal data:
>> offset:0 length:524288
>> offset:524288 length:524288
>> offset:1048576 length:524288
>> offset:1572864 length:524288
>> offset:2097152 length:524288
>> offset:2621440 length:524288
>> offset:3145728 length:524288
>> offset:3670016 length:524288
>>
>> Case two:With patch
>> cat /sys/block/rbd0/queue/max_sectors_kb
>> 4096
>> 03:49:14-root@...ver-186:linux-4.0.5$cat /sys/block/rbd0/queue/max_segments
>> 1024
>> The output of systemtap for nornal data:
>> offset:0 length:1048576
>> offset:1048576 length:1048576
>> offset:2097152 length:1048576
>> offset:3145728 length:1048576
>>
>> According to the test, you are right.
>> Because the blk-mq doesn't use any scheduling policy.
>> 03:52:13-root@...ver-186:linux-4.0.5$cat /sys/block/rbd0/queue/scheduler
>> none
>>
>> In previous versions of the kernel 4.0, the rbd use the defualt
>> scheduler:cfq
>>
>> So, I think that the blk-mq need to do more?
>
> There is no scheduler support in blk-mq as of now but your numbers
> don't have anything to do with that.  The current behaviour is a result
> of a bug in blk-mq.  It's fixed by [1], if you apply it you should see
> 4M requests with your stap script.
>
> [1] http://article.gmane.org/gmane.linux.kernel/1941750
>
Hi.
First, Let's look at the result in the kernel version 3.18
The function blk_limits_max_hw_sectors different implemention between 
3.18 and 4.0+. We need do:
echo 4094 >/sys/block/rbd0/queue/max_sectors_kb

The rbd device information:
11:13:18-root@...ver-186:~$cat /sys/block/rbd0/queue/max_sectors_kb
4094
11:15:28-root@...ver-186:~$cat /sys/block/rbd0/queue/max_segments
1024

The test command:
dd if=/dev/zero of=/dev/rbd0 bs=4M count=1

The simple stap script:
probe module("rbd").function("rbd_img_request_create")
{
     printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3));
}

The output from stap:
offset:0 length:4190208
offset:21474770944 length:4096

Second, thanks for your patch [1].
I use the patch [1], and recompile the kernel.
The test information as shown below:
12:26:12-root@...ver-186:$cat /sys/block/rbd0/queue/max_segments
1024
12:26:23-root@...ver-186:$cat /sys/block/rbd0/queue/max_sectors_kb
4096

The test command:
dd if=/dev/zero  of=/dev/rbd0 bs=4M count=2 oflag=direct

The simple systemtap script:
probe module("rbd").function("rbd_img_request_create")
{
     printf("offset:%lu length:%lu\n", ulong_arg(2), ulong_arg(3));
}

The output of systemtap for nornal data:
offset:0 length:4194304
offset:4194304 length:4194304
offset:21474770944 length:4096

So, I think that the max_segments of request_limits should be divide the
object size by PAGE_SIZE.

[1] http://article.gmane.org/gmane.linux.kernel/1941750

Thanks.
----
juncheng bai

> Thanks,
>
>                  Ilya
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ