netdev - Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101030200603.GA19033@redhat.com>
Date:	Sat, 30 Oct 2010 22:06:03 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Shirley Ma <mashirle@...ibm.com>
Cc:	David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote:
> On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote:
> > Hmm. I don't yet understand. We are still doing copies into the per-vq
> > buffer, and the data copied is really small.  Is it about cache line
> > bounces?  Could you try figuring it out?
> 
> per-vq buffer is much less expensive than 3 put_copy() call. I will
> collect the profiling data to show that.

What about __put_user? Maybe the access checks are the ones
that add the cost here? I attach patches to strip access checks:
they are not needed as we do them on setup time already, anyway.
Can you try them out and see if performance is improved for you please?
On top of this, we will need to add some scheme to accumulate signals,
but that is a separate issue.

> > > > 2. How about flushing out queued stuff before we exit
> > > >    the handle_tx loop? That would address most of
> > > >    the spec issue. 
> > > 
> > > The performance is almost as same as the previous patch. I will
> > resubmit
> > > the modified one, adding vhost_add_used_and_signal_n after handle_tx
> > > loop for processing pending queue.
> > > 
> > > This patch was a part of modified macvtap zero copy which I haven't
> > > submitted yet. I found this helped vhost TX in general. This pending
> > > queue will be used by DMA done later, so I put it in vq instead of a
> > > local variable in handle_tx.
> > > 
> > > Thanks
> > > Shirley
> > 
> > BTW why do we need another array? Isn't heads field exactly what we
> > need
> > here?
> 
> head field is only for up to 32, the more used buffers add and signal
> accumulated the better performance is from test results.

I think we should separate the used update and signalling.  Interrupts
are expensive so I can believe accumulating even up to 100 of them
helps. But used head copies are already prety cheap. If we cut the
overhead by x32, that should make them almost free?

> That's was one
> of the reason I didn't use heads. The other reason was I used these
> buffer for pending dma done in mavctap zero copy patch. It could be up
> to vq->num in worse case.

We can always increase that, not an issue.

> Thanks
> Shirley

View attachment "1" of type "text/plain" (809 bytes)

View attachment "2" of type "text/plain" (2890 bytes)