[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50160E8B.7020508@redhat.com>
Date: Mon, 30 Jul 2012 12:33:15 +0800
From: Asias He <asias@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
CC: linux-kernel@...r.kernel.org,
Rusty Russell <rusty@...tcorp.com.au>,
virtualization@...ts.linux-foundation.org, kvm@...r.kernel.org,
Christoph Hellwig <hch@....de>,
Minchan Kim <minchan.kim@...il.com>
Subject: Re: [PATCH V4 3/3] virtio-blk: Add bio-based IO path for virtio-blk
On 07/29/2012 07:11 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote:
>> This patch introduces bio-based IO path for virtio-blk.
>>
>> Compared to request-based IO path, bio-based IO path uses driver
>> provided ->make_request_fn() method to bypasses the IO scheduler. It
>> handles the bio to device directly without allocating a request in block
>> layer. This reduces the IO path in guest kernel to achieve high IOPS
>> and lower latency. The downside is that guest can not use the IO
>> scheduler to merge and sort requests. However, this is not a big problem
>> if the backend disk in host side uses faster disk device.
>
> If this optimization depends on the host, then it
> should be reported to the guest using a feature bit,
> as opposed to being guest driven.
>
>> When the bio-based IO path is not enabled, virtio-blk still uses the
>> original request-based IO path, no performance difference is observed.
>>
>> Performance evaluation:
>> -----------------------------
>> 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
>> kvm tool.
>>
>> Short version:
>> With bio-based IO path, sequential read/write, random read/write
>> IOPS boost : 28%, 24%, 21%, 16%
>> Latency improvement: 32%, 17%, 21%, 16%
>>
>> Long version:
>> With bio-based IO path:
>> seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
>> seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
>> rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
>> rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec
>> clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
>> clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
>> clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
>> clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
>> cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
>> cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
>> cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
>> cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
>> With request-based IO path:
>> seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
>> seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
>> rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
>> rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
>> clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
>> clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
>> clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
>> clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
>> cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
>> cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
>> cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
>> cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985
>
>
> So bio based causes cpu to jump up by some 30%?
> What happens if you divide IOPS/CPU?
> Any improvement that comes from increasing the cpu share
> of the given guest on the host will not scale well on
> a typical overcommitted host.
if you add sys and usr time up, the jump is that much.
For ramdisk based device,
bio-based
------------------
>>> 74.08 + 703.84
777.9200000000001
>>> 70.92 + 702.81
773.7299999999999
>>> 72.23 + 695.37
767.6
>>> 69.69 + 654.13
723.8199999999999
>>> 53.28 + 724.19
777.47
req-based
------------------
>>> 53.28 + 724.19
777.47
>>> 49.03 + 633.20
682.23
>>> 55.78 + 722.40
778.18
>>> 56.55 + 690.83
747.38
And for real device(fusion io), the cpu time is smaller with bio path.
bio-based
------------------
>>> 56.79 + 421.70
478.49
>>> 61.81 + 455.53
517.3399999999999
>>> 63.10+455.38
518.48
>>> 62.04 + 453.58
515.62
req-based
------------------
>>> 44.08 + 590.71
634.7900000000001
>>> 48.73 + 610.78
659.51
>>> 45.58 + 581.16
626.74
>>> 48.40 + 599.65
648.05
>
>> 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
>> kvm tool.
>>
>> Short version:
>> With bio-based IO path, sequential read/write, random read/write
>> IOPS boost : 11%, 11%, 13%, 10%
>> Latency improvement: 10%, 10%, 12%, 10%
>> Long Version:
>> With bio-based IO path:
>> read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
>> write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
>> read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
>> write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
>> clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
>> clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
>> clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
>> clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
>> cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
>> cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
>> cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
>> cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
>> With request-based IO path:
>> read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
>> write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
>> read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
>> write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
>> clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
>> clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
>> clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
>> clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
>> cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
>> cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
>> cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
>> cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409
>
>
> Is this host or guest cpu? We should probably measure host cpu
> as this includes device overhead which could vary by load.
It's guest cpu. Yes, host cpu is also interesting.
>
>> How to use:
>> -----------------------------
>> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
>> use_bio=1' to enable ->make_request_fn() based I/O path.
>>
>> Cc: Rusty Russell <rusty@...tcorp.com.au>
>> Cc: "Michael S. Tsirkin" <mst@...hat.com>
>> Cc: virtualization@...ts.linux-foundation.org
>> Cc: kvm@...r.kernel.org
>> Cc: linux-kernel@...r.kernel.org
>> Acked-by: Rusty Russell <rusty@...tcorp.com.au>
>> Signed-off-by: Christoph Hellwig <hch@....de>
>> Signed-off-by: Minchan Kim <minchan.kim@...il.com>
>> Signed-off-by: Asias He <asias@...hat.com>
>> ---
>> drivers/block/virtio_blk.c | 203 +++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 163 insertions(+), 40 deletions(-)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index c0bbeb4..95cfeed 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -14,6 +14,9 @@
>>
>> #define PART_BITS 4
>>
>> +static bool use_bio;
>> +module_param(use_bio, bool, S_IRUGO);
>> +
>> static int major;
>> static DEFINE_IDA(vd_index_ida);
>>
>> @@ -23,6 +26,7 @@ struct virtio_blk
>> {
>> struct virtio_device *vdev;
>> struct virtqueue *vq;
>> + wait_queue_head_t queue_wait;
>>
>> /* The disk structure for the kernel. */
>> struct gendisk *disk;
>> @@ -51,53 +55,87 @@ struct virtio_blk
>> struct virtblk_req
>> {
>> struct request *req;
>> + struct bio *bio;
>> struct virtio_blk_outhdr out_hdr;
>> struct virtio_scsi_inhdr in_hdr;
>> u8 status;
>> + struct scatterlist sg[];
>> };
>>
>> -static void blk_done(struct virtqueue *vq)
>> +static inline int virtblk_result(struct virtblk_req *vbr)
>> +{
>> + switch (vbr->status) {
>> + case VIRTIO_BLK_S_OK:
>> + return 0;
>> + case VIRTIO_BLK_S_UNSUPP:
>> + return -ENOTTY;
>> + default:
>> + return -EIO;
>> + }
>> +}
>> +
>> +static inline void virtblk_request_done(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr)
>> +{
>> + struct request *req = vbr->req;
>> + int error = virtblk_result(vbr);
>> +
>> + if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
>> + req->resid_len = vbr->in_hdr.residual;
>> + req->sense_len = vbr->in_hdr.sense_len;
>> + req->errors = vbr->in_hdr.errors;
>> + } else if (req->cmd_type == REQ_TYPE_SPECIAL) {
>> + req->errors = (error != 0);
>> + }
>> +
>> + __blk_end_request_all(req, error);
>> + mempool_free(vbr, vblk->pool);
>> +}
>> +
>> +static inline void virtblk_bio_done(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr)
>> +{
>> + bio_endio(vbr->bio, virtblk_result(vbr));
>> + mempool_free(vbr, vblk->pool);
>> +}
>> +
>> +static void virtblk_done(struct virtqueue *vq)
>> {
>> struct virtio_blk *vblk = vq->vdev->priv;
>> + unsigned long bio_done = 0, req_done = 0;
>> struct virtblk_req *vbr;
>> - unsigned int len;
>> unsigned long flags;
>> + unsigned int len;
>>
>> spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
>> while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
>> - int error;
>> -
>> - switch (vbr->status) {
>> - case VIRTIO_BLK_S_OK:
>> - error = 0;
>> - break;
>> - case VIRTIO_BLK_S_UNSUPP:
>> - error = -ENOTTY;
>> - break;
>> - default:
>> - error = -EIO;
>> - break;
>> - }
>> -
>> - switch (vbr->req->cmd_type) {
>> - case REQ_TYPE_BLOCK_PC:
>> - vbr->req->resid_len = vbr->in_hdr.residual;
>> - vbr->req->sense_len = vbr->in_hdr.sense_len;
>> - vbr->req->errors = vbr->in_hdr.errors;
>> - break;
>> - case REQ_TYPE_SPECIAL:
>> - vbr->req->errors = (error != 0);
>> - break;
>> - default:
>> - break;
>> + if (vbr->bio) {
>> + virtblk_bio_done(vblk, vbr);
>> + bio_done++;
>> + } else {
>> + virtblk_request_done(vblk, vbr);
>> + req_done++;
>> }
>> -
>> - __blk_end_request_all(vbr->req, error);
>> - mempool_free(vbr, vblk->pool);
>> }
>> /* In case queue is stopped waiting for more buffers. */
>> - blk_start_queue(vblk->disk->queue);
>> + if (req_done)
>> + blk_start_queue(vblk->disk->queue);
>> spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
>> +
>> + if (bio_done)
>> + wake_up(&vblk->queue_wait);
>> +}
>> +
>> +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
>> + gfp_t gfp_mask)
>> +{
>> + struct virtblk_req *vbr;
>> +
>> + vbr = mempool_alloc(vblk->pool, gfp_mask);
>> + if (vbr && use_bio)
>> + sg_init_table(vbr->sg, vblk->sg_elems);
>> +
>> + return vbr;
>> }
>>
>> static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>> @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>> unsigned long num, out = 0, in = 0;
>> struct virtblk_req *vbr;
>>
>> - vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
>> + vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
>> if (!vbr)
>> /* When another request finishes we'll try again. */
>> return false;
>>
>> vbr->req = req;
>> -
>> + vbr->bio = NULL;
>> if (req->cmd_flags & REQ_FLUSH) {
>> vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
>> vbr->out_hdr.sector = 0;
>> @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>> }
>> }
>>
>> - if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
>> + if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
>> + GFP_ATOMIC) < 0) {
>> mempool_free(vbr, vblk->pool);
>> return false;
>> }
>> @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>> return true;
>> }
>>
>> -static void do_virtblk_request(struct request_queue *q)
>> +static void virtblk_request(struct request_queue *q)
>> {
>> struct virtio_blk *vblk = q->queuedata;
>> struct request *req;
>> @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q)
>> virtqueue_kick(vblk->vq);
>> }
>>
>> +static void virtblk_add_buf_wait(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr,
>> + unsigned long out,
>> + unsigned long in)
>> +{
>> + DEFINE_WAIT(wait);
>> +
>> + for (;;) {
>> + prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
>> + TASK_UNINTERRUPTIBLE);
>> +
>> + spin_lock_irq(vblk->disk->queue->queue_lock);
>> + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
>> + GFP_ATOMIC) < 0) {
>> + spin_unlock_irq(vblk->disk->queue->queue_lock);
>> + io_schedule();
>> + } else {
>> + virtqueue_kick(vblk->vq);
>> + spin_unlock_irq(vblk->disk->queue->queue_lock);
>> + break;
>> + }
>> +
>> + }
>> +
>> + finish_wait(&vblk->queue_wait, &wait);
>> +}
>> +
>> +static void virtblk_make_request(struct request_queue *q, struct bio *bio)
>> +{
>> + struct virtio_blk *vblk = q->queuedata;
>> + unsigned int num, out = 0, in = 0;
>> + struct virtblk_req *vbr;
>> +
>> + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
>> + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
>> +
>> + vbr = virtblk_alloc_req(vblk, GFP_NOIO);
>> + if (!vbr) {
>> + bio_endio(bio, -ENOMEM);
>> + return;
>> + }
>> +
>> + vbr->bio = bio;
>> + vbr->req = NULL;
>> + vbr->out_hdr.type = 0;
>> + vbr->out_hdr.sector = bio->bi_sector;
>> + vbr->out_hdr.ioprio = bio_prio(bio);
>> +
>> + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
>> +
>> + num = blk_bio_map_sg(q, bio, vbr->sg + out);
>> +
>> + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
>> + sizeof(vbr->status));
>> +
>> + if (num) {
>> + if (bio->bi_rw & REQ_WRITE) {
>> + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
>> + out += num;
>> + } else {
>> + vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
>> + in += num;
>> + }
>> + }
>> +
>> + spin_lock_irq(vblk->disk->queue->queue_lock);
>> + if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
>> + GFP_ATOMIC) < 0)) {
>> + spin_unlock_irq(vblk->disk->queue->queue_lock);
>> + virtblk_add_buf_wait(vblk, vbr, out, in);
>> + return;
>> + }
>> + virtqueue_kick(vblk->vq);
>> + spin_unlock_irq(vblk->disk->queue->queue_lock);
>> +}
>> +
>> /* return id (s/n) string for *disk to *id_str
>> */
>> static int virtblk_get_id(struct gendisk *disk, char *id_str)
>> @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk)
>> int err = 0;
>>
>> /* We expect one virtqueue, for output. */
>> - vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
>> + vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
>> if (IS_ERR(vblk->vq))
>> err = PTR_ERR(vblk->vq);
>>
>> @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
>> u8 writeback = virtblk_get_cache_mode(vdev);
>> struct virtio_blk *vblk = vdev->priv;
>>
>> - if (writeback)
>> + if (writeback && !use_bio)
>> blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
>> else
>> blk_queue_flush(vblk->disk->queue, 0);
>> @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>> struct virtio_blk *vblk;
>> struct request_queue *q;
>> int err, index;
>> + int pool_size;
>> +
>> u64 cap;
>> u32 v, blk_size, sg_elems, opt_io_size;
>> u16 min_io_size;
>> @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>> goto out_free_index;
>> }
>>
>> + init_waitqueue_head(&vblk->queue_wait);
>> vblk->vdev = vdev;
>> vblk->sg_elems = sg_elems;
>> sg_init_table(vblk->sg, vblk->sg_elems);
>> mutex_init(&vblk->config_lock);
>> +
>> INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
>> vblk->config_enable = true;
>>
>> @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>> if (err)
>> goto out_free_vblk;
>>
>> - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
>> + pool_size = sizeof(struct virtblk_req);
>> + if (use_bio)
>> + pool_size += sizeof(struct scatterlist) * sg_elems;
>> + vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
>> if (!vblk->pool) {
>> err = -ENOMEM;
>> goto out_free_vq;
>> @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>> goto out_mempool;
>> }
>>
>> - q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
>> + q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
>> if (!q) {
>> err = -ENOMEM;
>> goto out_put_disk;
>> }
>>
>> + if (use_bio)
>> + blk_queue_make_request(q, virtblk_make_request);
>> q->queuedata = vblk;
>>
>> virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
>> @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>> if (!err && opt_io_size)
>> blk_queue_io_opt(q, blk_size * opt_io_size);
>>
>> -
>> add_disk(vblk->disk);
>> err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
>> if (err)
>> --
>> 1.7.10.4
--
Asias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists