netdev - Re: [PATCH 1/2] vhost: allow multiple workers threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <OFC071CE44.FE2F733D-ON862579AA.0055FCA0-862579AA.0057082B@us.ibm.com>
Date:	Mon, 20 Feb 2012 09:50:37 -0600
From:	Tom Lendacky <toml@...ibm.com>
To:	"Michael S. Tsirkin" <mst@...hat.com>
Cc:	Anthony Liguori <aliguori@...ibm.com>, netdev@...r.kernel.org,
	Cristian Viana <vianac@...ibm.com>
Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads

"Michael S. Tsirkin" <mst@...hat.com> wrote on 02/19/2012 08:41:45 AM:

> From: "Michael S. Tsirkin" <mst@...hat.com>
> To: Anthony Liguori/Austin/IBM@...US
> Cc: netdev@...r.kernel.org, Tom Lendacky/Austin/IBM@...US, Cristian
> Viana <vianac@...ibm.com>
> Date: 02/19/2012 08:42 AM
> Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
>
> On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
> > This patch allows vhost to have multiple worker threads for devices
such as
> > virtio-net which may have multiple virtqueues.
> >
> > Since virtqueues are a lockless ring queue, in an ideal world data is
being
> > produced by the producer as fast as data is being consumed by the
consumer.
> > These loops will continue to consume data until none is left.
> >
> > vhost currently multiplexes the consumer side of the queue on a
> single thread
> > by attempting to read from the queue until everything is read or it
cannot
> > process anymore.  This means that activity on one queue may stall
> another queue.
>
> There's actually an attempt to address this: look up
> VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
>
> > This is exacerbated when using any form of polling to read from
> the queues (as
> > we'll introduce in the next patch).  By spawning a thread per-
> virtqueue, this
> > is addressed.
> >
> > The only problem with this patch right now is how the wake up of
> the threads is
> > done.  It's essentially a broadcast and we have seen lock contention as
a
> > result.
>
> On which lock?

The mutex lock in the vhost_virtqueue struct.  This really shows up when
running with patch 2/2 and increasing the spin_threshold. Both threads wake
up and try to acquire the mutex.  As the spin_threshold increases you end
up
with one of the threads getting blocked for a longer and longer time and
unable to do any RX processing that might be needed.

Tom

>
> > We've tried some approaches to signal a single thread but I'm not
> > confident that that code is correct yet so I'm only sending the
broadcast
> > version.
>
> Yes, that looks like an obvious question.
>
> > Here are some performance results from this change.  There's a modest
> > improvement with stream although a fair bit of variability too.
> >
> > With RR, there's pretty significant improvements as the instance
> rate drives up.
>
> Interesting. This was actually tested at one time and we saw
> a significant performance improvement from using
> a single thread especially with a single
> stream in the guest. Profiling indicated that
> with a single thread we get too many context
> switches between TX and RX, since guest networking
> tends to run TX and RX processing on the same
> guest VCPU.
>
> Maybe we were wrong or maybe this went away
> for some reason. I'll see if this can be reproduced.
>
> >
> > Test, Size, Instance, Baseline, Patch, Relative
> >
> > TCP_RR
> >    Tx:256 Rx:256
> >       1   9,639.66   10,164.06   105.44%
> >       10   62,819.55   54,059.78   86.06%
> >       30   84,715.60   131,241.86   154.92%
> >       60   124,614.71   148,720.66   119.34%
> >
> > UDP_RR
> >    Tx:256 Rx:256
> >       1   9,652.50   10,343.72   107.16%
> >       10   53,830.26   58,235.90   108.18%
> >       30   89,471.01   97,634.53   109.12%
> >       60   103,640.59   164,035.01   158.27%
> >
> > TCP_STREAM
> >    Tx: 256
> >       1   2,622.63   2,610.71   99.55%
> >       4   4,928.02   4,812.05   97.65%
> >
> >    Tx: 1024
> >       1   5,639.89   5,751.28   101.97%
> >       4   5,874.72   6,575.55   111.93%
> >
> >    Tx: 4096
> >       1   6,257.42   7,655.22   122.34%
> >       4   5,370.78   6,044.83   112.55%
> >
> >    Tx: 16384
> >       1   6,346.63   7,267.44   114.51%
> >       4   5,198.02   5,657.12   108.83%
> >
> > TCP_MAERTS
> >    Rx: 256
> >       1   2,091.38   1,765.62   84.42%
> >       4   5,319.52   5,619.49   105.64%
> >
> >    Rx: 1024
> >       1   7,030.66   7,593.61   108.01%
> >       4   9,040.53   7,275.84   80.48%
> >
> >    Rx: 4096
> >       1   9,160.93   9,318.15   101.72%
> >       4   9,372.49   8,875.63   94.70%
> >
> >    Rx: 16384
> >       1   9,183.28   9,134.02   99.46%
> >       4   9,377.17   8,877.52   94.67%
> >
> >                      106.46%
>
> One point missing here is the ratio of
> throughput divided by host CPU utilization.
>
> The concern being to avoid hurting
> heavily loaded systems.
>
> These numbers also tend to be less variable as
> they depend less on host scheduler decisions.
>
>
> > Cc: Tom Lendacky <toml@...ibm.com>
> > Cc: Cristian Viana <vianac@...ibm.com>
> > Signed-off-by: Anthony Liguori <aliguori@...ibm.com>
> > ---
> >  drivers/vhost/net.c   |    6 ++++-
> >  drivers/vhost/vhost.c |   51 +++++++++++++++++++++++++++++
> +------------------
> >  drivers/vhost/vhost.h |    8 +++++-
> >  3 files changed, 43 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 9dab1f5..47175cd 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -33,6 +33,10 @@ static int experimental_zcopytx;
> >  module_param(experimental_zcopytx, int, 0444);
> >  MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy
TX");
> >
> > +static int workers = 2;
> > +module_param(workers, int, 0444);
> > +MODULE_PARM_DESC(workers, "Set the number of worker threads");
> > +
> >  /* Max number of bytes transferred before requeueing the job.
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> > @@ -504,7 +508,7 @@ static int vhost_net_open(struct inode *inode,
> struct file *f)
> >     dev = &n->dev;
> >     n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> >     n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> > -   r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
> > +   r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
> >     if (r < 0) {
> >        kfree(n);
> >        return r;
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c14c42b..676cead 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -144,9 +144,12 @@ static inline void vhost_work_queue(struct
> vhost_dev *dev,
> >
> >     spin_lock_irqsave(&dev->work_lock, flags);
> >     if (list_empty(&work->node)) {
> > +      int i;
> > +
> >        list_add_tail(&work->node, &dev->work_list);
> >        work->queue_seq++;
> > -      wake_up_process(dev->worker);
> > +      for (i = 0; i < dev->nworkers; i++)
> > +         wake_up_process(dev->workers[i]);
> >     }
> >     spin_unlock_irqrestore(&dev->work_lock, flags);
> >  }
> > @@ -287,7 +290,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev
*dev)
> >  }
> >
> >  long vhost_dev_init(struct vhost_dev *dev,
> > -          struct vhost_virtqueue *vqs, int nvqs)
> > +          struct vhost_virtqueue *vqs, int nworkers, int nvqs)
> >  {
> >     int i;
> >
> > @@ -300,7 +303,9 @@ long vhost_dev_init(struct vhost_dev *dev,
> >     dev->mm = NULL;
> >     spin_lock_init(&dev->work_lock);
> >     INIT_LIST_HEAD(&dev->work_list);
> > -   dev->worker = NULL;
> > +   dev->nworkers = min(nworkers, VHOST_MAX_WORKERS);
> > +   for (i = 0; i < dev->nworkers; i++)
> > +      dev->workers[i] = NULL;
> >
> >     for (i = 0; i < dev->nvqs; ++i) {
> >        dev->vqs[i].log = NULL;
> > @@ -354,7 +359,7 @@ static int vhost_attach_cgroups(struct vhost_dev
*dev)
> >  static long vhost_dev_set_owner(struct vhost_dev *dev)
> >  {
> >     struct task_struct *worker;
> > -   int err;
> > +   int err, i;
> >
> >     /* Is there an owner already? */
> >     if (dev->mm) {
> > @@ -364,28 +369,34 @@ static long vhost_dev_set_owner(struct vhost_dev
*dev)
> >
> >     /* No owner, become one */
> >     dev->mm = get_task_mm(current);
> > -   worker = kthread_create(vhost_worker, dev, "vhost-%d", current->
pid);
> > -   if (IS_ERR(worker)) {
> > -      err = PTR_ERR(worker);
> > -      goto err_worker;
> > -   }
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      worker = kthread_create(vhost_worker, dev, "vhost-%d.%d",
> current->pid, i);
> > +      if (IS_ERR(worker)) {
> > +         err = PTR_ERR(worker);
> > +         goto err_worker;
> > +      }
> >
> > -   dev->worker = worker;
> > -   wake_up_process(worker);   /* avoid contributing to loadavg */
> > +      dev->workers[i] = worker;
> > +      wake_up_process(worker);   /* avoid contributing to loadavg */
> > +   }
> >
> >     err = vhost_attach_cgroups(dev);
> >     if (err)
> > -      goto err_cgroup;
> > +      goto err_worker;
> >
> >     err = vhost_dev_alloc_iovecs(dev);
> >     if (err)
> > -      goto err_cgroup;
> > +      goto err_worker;
> >
> >     return 0;
> > -err_cgroup:
> > -   kthread_stop(worker);
> > -   dev->worker = NULL;
> > +
> >  err_worker:
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      if (dev->workers[i]) {
> > +         kthread_stop(dev->workers[i]);
> > +         dev->workers[i] = NULL;
> > +      }
> > +   }
> >     if (dev->mm)
> >        mmput(dev->mm);
> >     dev->mm = NULL;
> > @@ -475,9 +486,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >                 lockdep_is_held(&dev->mutex)));
> >     RCU_INIT_POINTER(dev->memory, NULL);
> >     WARN_ON(!list_empty(&dev->work_list));
> > -   if (dev->worker) {
> > -      kthread_stop(dev->worker);
> > -      dev->worker = NULL;
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      if (dev->workers[i]) {
> > +         kthread_stop(dev->workers[i]);
> > +         dev->workers[i] = NULL;
> > +      }
> >     }
> >     if (dev->mm)
> >        mmput(dev->mm);
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index a801e28..fa5e75e 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -18,6 +18,9 @@
> >  #define VHOST_DMA_DONE_LEN   1
> >  #define VHOST_DMA_CLEAR_LEN   0
> >
> > +/* Maximum number of worker threads per-vhost device */
> > +#define VHOST_MAX_WORKERS   8
> > +
> >  struct vhost_device;
> >
> >  struct vhost_work;
> > @@ -157,10 +160,11 @@ struct vhost_dev {
> >     struct eventfd_ctx *log_ctx;
> >     spinlock_t work_lock;
> >     struct list_head work_list;
> > -   struct task_struct *worker;
> > +   int nworkers;
> > +   struct task_struct *workers[VHOST_MAX_WORKERS];
> >  };
> >
> > -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> *vqs, int nvqs);
> > +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> *vqs, int nworkers, int nvqs);
> >  long vhost_dev_check_owner(struct vhost_dev *);
> >  long vhost_dev_reset_owner(struct vhost_dev *);
> >  void vhost_dev_cleanup(struct vhost_dev *);
> > --
> > 1.7.4.1
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html