linux-kernel - Re: [PATCH 03/18] xen/pvcalls: initialize the module and register the xenbus backend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1705181250410.6833@sstabellini-ThinkPad-X260>
Date:   Thu, 18 May 2017 14:18:18 -0700 (PDT)
From:   Stefano Stabellini <sstabellini@...nel.org>
To:     Juergen Gross <jgross@...e.com>
cc:     Stefano Stabellini <sstabellini@...nel.org>,
        xen-devel@...ts.xen.org, linux-kernel@...r.kernel.org,
        boris.ostrovsky@...cle.com,
        Stefano Stabellini <stefano@...reto.com>
Subject: Re: [PATCH 03/18] xen/pvcalls: initialize the module and register
 the xenbus backend

On Wed, 17 May 2017, Juergen Gross wrote:
> On 16/05/17 21:58, Stefano Stabellini wrote:
> > On Tue, 16 May 2017, Juergen Gross wrote:
> >> On 15/05/17 22:35, Stefano Stabellini wrote:
> >>> The pvcalls backend has one ioworker per cpu: the ioworkers are
> >>> implemented as a cpu bound workqueue, and will deal with the actual
> >>> socket and data ring reads/writes.
> >>>
> >>> ioworkers are global: we only have one set for all the frontends. They
> >>> process requests on their wqs list in order, once they are done with a
> >>> request, they'll remove it from the list. A spinlock is used for
> >>> protecting the list. Each ioworker is bound to a different cpu to
> >>> maximize throughput.
> >>>
> >>> Signed-off-by: Stefano Stabellini <stefano@...reto.com>
> >>> CC: boris.ostrovsky@...cle.com
> >>> CC: jgross@...e.com
> >>> ---
> >>>  drivers/xen/pvcalls-back.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>  1 file changed, 64 insertions(+)
> >>>
> >>> diff --git a/drivers/xen/pvcalls-back.c b/drivers/xen/pvcalls-back.c
> >>> index 2dbf7d8..46a889a 100644
> >>> --- a/drivers/xen/pvcalls-back.c
> >>> +++ b/drivers/xen/pvcalls-back.c
> >>> @@ -25,6 +25,26 @@
> >>>  #include <xen/xenbus.h>
> >>>  #include <xen/interface/io/pvcalls.h>
> >>>  
> >>> +struct pvcalls_ioworker {
> >>> +	struct work_struct register_work;
> >>> +	atomic_t io;
> >>> +	struct list_head wqs;
> >>> +	spinlock_t lock;
> >>> +	int num;
> >>> +};
> >>> +
> >>> +struct pvcalls_back_global {
> >>> +	struct pvcalls_ioworker *ioworkers;
> >>> +	int nr_ioworkers;
> >>> +	struct workqueue_struct *wq;
> >>> +	struct list_head privs;
> >>> +	struct rw_semaphore privs_lock;
> >>> +} pvcalls_back_global;
> >>> +
> >>> +static void pvcalls_back_ioworker(struct work_struct *work)
> >>> +{
> >>> +}
> >>> +
> >>>  static int pvcalls_back_probe(struct xenbus_device *dev,
> >>>  			      const struct xenbus_device_id *id)
> >>>  {
> >>> @@ -59,3 +79,47 @@ static int pvcalls_back_uevent(struct xenbus_device *xdev,
> >>>  	.uevent = pvcalls_back_uevent,
> >>>  	.otherend_changed = pvcalls_back_changed,
> >>>  };
> >>> +
> >>> +static int __init pvcalls_back_init(void)
> >>> +{
> >>> +	int ret, i, cpu;
> >>> +
> >>> +	if (!xen_domain())
> >>> +		return -ENODEV;
> >>> +
> >>> +	ret = xenbus_register_backend(&pvcalls_back_driver);
> >>> +	if (ret < 0)
> >>> +		return ret;
> >>> +
> >>> +	init_rwsem(&pvcalls_back_global.privs_lock);
> >>> +	INIT_LIST_HEAD(&pvcalls_back_global.privs);
> >>> +	pvcalls_back_global.wq = alloc_workqueue("pvcalls_io", 0, 0);
> >>> +	if (!pvcalls_back_global.wq)
> >>> +		goto error;
> >>> +	pvcalls_back_global.nr_ioworkers = num_online_cpus();
> >>
> >> Really? Recently I cam across a system with 640 dom0 cpus. I don't think
> >> we want 640 workers initialized when loading the backend module. I'd
> >> prefer one or a few workers per connected frontend.
> > 
> > I think we want to keep the ioworker allocation to be based on the
> > number of vcpus: we do not want more ioworkers than vcpus because it is
> > a waste of resources and leads to worse performance.  Also, given that
> > they do memcpy's, I also think it is a good idea to bind them to vcpus
> > (and pin vcpus to pcpus) to get best performance.
> 
> This will cause a lot of pain for the cpu offline case. Please don't try
> to work against the hypervisor scheduler by designing a backend based on
> a vcpu pin policy. This might result in best performance for your
> special workload, but generally it is a bad idea!

You are right. Of course, vcpu pinning is not a fundamental requirement
for this backend. I wrote about vcpu pinning only to help with the
explanation.

However, pvcalls is a memcpy based protocol and to perform memcpys
efficiently is very important to keep caches hot. The target is to hit
the same cacheline when reading and writing, which makes a huge
difference; it depends on processor and architecture but it is easily
20%. To get caching benefits, we need to do memcpys for the same socket
on the same vcpu (and on the same pcpu as well, that's why I mentioned
vcpu pinning, but we'll trust the Xen scheduler to do the right thing
when there is no contention).

This is why in this backend, regardless of the workqueue
design/allocation we use, I think we have to stick to two basic
principles:

- each socket is bound to one vcpu
- sockets are distributed evenly across vcpus


> > However, you have a point there: we need to handle systems with an
> > extremely large number of Dom0 vcpus. I suggest we introduce an
> > upper limit for the number of ioworkers. Something like:
> > 
> > #define MAX_IOWORKERS 64
> > nr_ioworkers = min(MAX_IOWORKERS, num_online_cpus())
> > 
> > MAX_IOWORKERS could be configurable via a command line option.
> 
> Later you are assigning each active socket to exactly one ioworker.
> Wouldn't it make more sense to allocate the ioworker when doing
> the connect? This would avoid the problem of having only a statistical
> distribution, possibly with all sockets on the same ioworker.
>
> Basically you are re-inventing the wheel by using an own workqueue
> implementation in each ioworker looping through all assigned sockets.

It might be possible to create an ioworker for each socket (instead of
an ioworker for each vcpu) if we wanted to, as long as we bind it to a
vcpu and distribute them evenly across vcpus.