lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121121115734.GC31911@redhat.com>
Date:	Wed, 21 Nov 2012 13:57:34 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Asias He <asias@...hat.com>
Cc:	Rusty Russell <rusty@...tcorp.com.au>,
	Christoph Hellwig <hch@...radead.org>, kvm@...r.kernel.org,
	virtualization@...ts.linux-foundation.org, axboe@...nel.dk,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] vhost-blk: Add vhost-blk support v5

On Wed, Nov 21, 2012 at 12:24:55PM +0800, Asias He wrote:
> On 11/20/2012 09:37 PM, Michael S. Tsirkin wrote:
> > On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote:
> >> On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote:
> >>> On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote:
> >>>> vhost-blk is an in-kernel virito-blk device accelerator.
> >>>>
> >>>> Due to lack of proper in-kernel AIO interface, this version converts
> >>>> guest's I/O request to bio and use submit_bio() to submit I/O directly.
> >>>> So this version any supports raw block device as guest's disk image,
> >>>> e.g. /dev/sda, /dev/ram0. We can add file based image support to
> >>>> vhost-blk once we have in-kernel AIO interface. There are some work in
> >>>> progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:
> >>>>
> >>>>    http://marc.info/?l=linux-fsdevel&m=133312234313122
> >>>>
> >>>> Performance evaluation:
> >>>> -----------------------------
> >>>> 1) LKVM
> >>>> Fio with libaio ioengine on Fusion IO device using kvm tool
> >>>> IOPS(k)    Before       After   Improvement
> >>>> seq-read   107          121     +13.0%
> >>>> seq-write  130          179     +37.6%
> >>>> rnd-read   102          122     +19.6%
> >>>> rnd-write  125          159     +27.0%
> >>>>
> >>>> 2) QEMU
> >>>> Fio with libaio ioengine on Fusion IO device using QEMU
> >>>> IOPS(k)    Before       After   Improvement
> >>>> seq-read   76           123     +61.8%
> >>>> seq-write  139          173     +24.4%
> >>>> rnd-read   73           120     +64.3%
> >>>> rnd-write  75           156     +108.0%
> >>>
> >>> Could you compare with dataplane qemu as well please?
> >>
> >>
> >> Well, I will try to collect it.
> >>
> >>>
> >>>>
> >>>> Userspace bits:
> >>>> -----------------------------
> >>>> 1) LKVM
> >>>> The latest vhost-blk userspace bits for kvm tool can be found here:
> >>>> git@...hub.com:asias/linux-kvm.git blk.vhost-blk
> >>>>
> >>>> 2) QEMU
> >>>> The latest vhost-blk userspace prototype for QEMU can be found here:
> >>>> git@...hub.com:asias/qemu.git blk.vhost-blk
> >>>>
> >>>> Changes in v5:
> >>>> - Do not assume the buffer layout
> >>>> - Fix wakeup race
> >>>>
> >>>> Changes in v4:
> >>>> - Mark req->status as userspace pointer
> >>>> - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
> >>>> - Add if (need_resched()) schedule() in blk thread
> >>>> - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
> >>>> - Use vq_err() instead of pr_warn()
> >>>> - Fail un Unsupported request
> >>>> - Add flush in vhost_blk_set_features()
> >>>>
> >>>> Changes in v3:
> >>>> - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
> >>>> - Check file passed by user is a raw block device file
> >>>>
> >>>> Signed-off-by: Asias He <asias@...hat.com>
> >>>
> >>> Since there are files shared by this and vhost net
> >>> it's easiest for me to merge this all through the
> >>> vhost tree.
> >>>
> >>> Jens, could you ack this and the bio usage in this driver
> >>> please?
> >>>
> >>>> ---
> >>>>  drivers/vhost/Kconfig     |   1 +
> >>>>  drivers/vhost/Kconfig.blk |  10 +
> >>>>  drivers/vhost/Makefile    |   2 +
> >>>>  drivers/vhost/blk.c       | 697 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  drivers/vhost/blk.h       |   8 +
> >>>>  5 files changed, 718 insertions(+)
> >>>>  create mode 100644 drivers/vhost/Kconfig.blk
> >>>>  create mode 100644 drivers/vhost/blk.c
> >>>>  create mode 100644 drivers/vhost/blk.h
> >>>>
> >>>> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> >>>> index 202bba6..acd8038 100644
> >>>> --- a/drivers/vhost/Kconfig
> >>>> +++ b/drivers/vhost/Kconfig
> >>>> @@ -11,4 +11,5 @@ config VHOST_NET
> >>>>  
> >>>>  if STAGING
> >>>>  source "drivers/vhost/Kconfig.tcm"
> >>>> +source "drivers/vhost/Kconfig.blk"
> >>>>  endif
> >>>> diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
> >>>> new file mode 100644
> >>>> index 0000000..ff8ab76
> >>>> --- /dev/null
> >>>> +++ b/drivers/vhost/Kconfig.blk
> >>>> @@ -0,0 +1,10 @@
> >>>> +config VHOST_BLK
> >>>> +	tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
> >>>> +	depends on BLOCK &&  EXPERIMENTAL && m
> >>>> +	---help---
> >>>> +	  This kernel module can be loaded in host kernel to accelerate
> >>>> +	  guest block with virtio_blk. Not to be confused with virtio_blk
> >>>> +	  module itself which needs to be loaded in guest kernel.
> >>>> +
> >>>> +	  To compile this driver as a module, choose M here: the module will
> >>>> +	  be called vhost_blk.
> >>>> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> >>>> index a27b053..1a8a4a5 100644
> >>>> --- a/drivers/vhost/Makefile
> >>>> +++ b/drivers/vhost/Makefile
> >>>> @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
> >>>>  vhost_net-y := vhost.o net.o
> >>>>  
> >>>>  obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
> >>>> +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
> >>>> +vhost_blk-y := blk.o
> >>>> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
> >>>> new file mode 100644
> >>>> index 0000000..f0f118a
> >>>> --- /dev/null
> >>>> +++ b/drivers/vhost/blk.c
> >>>> @@ -0,0 +1,697 @@
> >>>> +/*
> >>>> + * Copyright (C) 2011 Taobao, Inc.
> >>>> + * Author: Liu Yuan <tailai.ly@...bao.com>
> >>>> + *
> >>>> + * Copyright (C) 2012 Red Hat, Inc.
> >>>> + * Author: Asias He <asias@...hat.com>
> >>>> + *
> >>>> + * This work is licensed under the terms of the GNU GPL, version 2.
> >>>> + *
> >>>> + * virtio-blk server in host kernel.
> >>>> + */
> >>>> +
> >>>> +#include <linux/miscdevice.h>
> >>>> +#include <linux/module.h>
> >>>> +#include <linux/vhost.h>
> >>>> +#include <linux/virtio_blk.h>
> >>>> +#include <linux/mutex.h>
> >>>> +#include <linux/file.h>
> >>>> +#include <linux/kthread.h>
> >>>> +#include <linux/blkdev.h>
> >>>> +#include <linux/llist.h>
> >>>> +
> >>>> +#include "vhost.c"
> >>>> +#include "vhost.h"
> >>>> +#include "blk.h"
> >>>> +
> >>>> +static DEFINE_IDA(vhost_blk_index_ida);
> >>>> +
> >>>> +enum {
> >>>> +	VHOST_BLK_VQ_REQ = 0,
> >>>> +	VHOST_BLK_VQ_MAX = 1,
> >>>> +};
> >>>> +
> >>>> +struct req_page_list {
> >>>> +	struct page **pages;
> >>>> +	int pages_nr;
> >>>> +};
> >>>> +
> >>>> +struct vhost_blk_req {
> >>>> +	struct llist_node llnode;
> >>>> +	struct req_page_list *pl;
> >>>> +	struct vhost_blk *blk;
> >>>> +
> >>>> +	struct iovec *iov;
> >>>> +	int iov_nr;
> >>>> +
> >>>> +	struct bio **bio;
> >>>> +	atomic_t bio_nr;
> >>>> +
> >>>> +	struct iovec status[1];
> >>>> +
> >>>> +	sector_t sector;
> >>>> +	int write;
> >>>> +	u16 head;
> >>>> +	long len;
> >>>> +};
> >>>> +
> >>>> +struct vhost_blk {
> >>>> +	struct task_struct *host_kick;
> >>>
> >>> Could you please add a comment here explaining why
> >>> is this better than using the builtin vhost thread?
> >>
> >> With separate thread model, the request submit and request completion
> >> can be handled in parallel. I have measured significant performance
> >> improvement with this model.
> >> In the long term, It would be nice if the
> >> vhost core can provide multiple thread support.
> > 
> > Strange, all completion does it write out the used ring.
> > I am guessing you didn't complete requests
> > in a timely manner which was what caused the issue.
> 
> Previously, there was two 'struct vhost_work' work which poll on the
> guest kick eventfd and host aio eventfd. The vhost thread is handling
> the works when there are data in eventfd, either guest kick or aio
> completion.

See VHOST_NET_WEIGHT in how in -net we prevent starvation
between tx and rx.

> > OTOH I'm not sure how increasing the number of threads
> > scales with # of VMs. Increased scheduler pressure is
> > also a concern.
> 
> Agree, this is also my concern. However, this is a trade-off. we have a
> price to pay for the parallelism.

If the work offloaded is trivial in size then the gain is likely
to be marginal.

> Also, one thread per vhost device
> might not scale with # of VMs too, compared to e.g. per cpu thread.

I'e no idea how to do QoS with a per cpu thread otherwise

> > Did you try checking completion after each
> > submitted request to address this?
> 
> Isn't this a sync operation? We can not only account on the checking the
> completion in the submit path. Another polling on the completion is
> needed anyway. Mixing the completion in submit path, 1) makes the submit
> delay longer 2) does not help with the case where a single request takes
> relatively long time to finish. IMHO, I do not think mixing the submit
> and completion is a good idea.

To clarify, you would
1. do llist_del_all in handle_kick -
   the delay this adds should be trivial
2. for completion queue work on vhost thread -
   this can be as simple as waking up the regular
   kick handler

You can further reduce overhead by batching
the wakeups in 2. See my patch
vhost-net: reduce vq polling on tx zerocopy
that does this for -net.

> -- 
> Asias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ