lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 30 Dec 2018 13:40:42 -0500
From:   "Michael S. Tsirkin" <mst@...hat.com>
To:     Jason Wang <jasowang@...hat.com>
Cc:     netdev@...r.kernel.org
Subject: Re: thoughts stac/clac and get user for vhost

On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
> 
> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
> > > On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
> > > > Hi!
> > > > I was just wondering: packed ring batches things naturally.
> > > > E.g.
> > > > 
> > > > user_access_begin
> > > > check descriptor valid
> > > > smp_rmb
> > > > copy descriptor
> > > > user_access_end
> > > 
> > > But without speculation on the descriptor (which may only work for in-order
> > > or even a violation of spec). Only one two access of a single descriptor
> > > could be batched. For split ring, we can batch more since we know how many
> > > descriptors is pending. (avail_idx - last_avail_idx).
> > > 
> > > Anything I miss?
> > > 
> > > Thanks
> > > 
> > just check more descriptors in a loop:
> > 
> >   user_access_begin
> >   for (i = 0; i < 16; ++i) {
> > 	 if (!descriptor valid)
> > 		break;
> > 	 smp_rmb
> > 	 copy descriptor
> >   }
> >   user_access_end
> > 
> > you don't really need to know how many there are
> > ahead of the time as you still copy them 1 by one.
> 
> 
> So let's see the case of split ring
> 
> 
> user_access_begin
> 
> n = avail_idx - last_avail_idx (1)
> 
> n = MIN(n, 16)
> 
> smp_rmb
> 
> read n entries from avail_ring (2)
> 
> for (i =0; i <n; i++)
> 
>     copy descriptor (3)
> 
> user_access_end
> 
> 
> Consider for the case of heavy workload. So for packed ring, we have 32
> times of userspace access and 16 times of smp_rmb()
> 
> For split ring we have
> 
> (1) 1 time
> 
> (2) 2 times at most
> 
> (3) 16 times
> 
> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
> eliminated with in order. 3 could be batched completely with in order and
> partially when out of order.
> 
> I don't see how packed ring help here especially consider lfence on x86 is
> more than memory fence, it prevents speculation in fact.
> 
> Thanks

So on x86 at least RMB is free, this is why I never bothered optimizing
it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
more than the extra indirection in the split ring?

But my point was really fundamental - if ring accesses are expensive
then we should batch them. Right now we have an API that gets
an iovec directly. That limits the optimizations you can do.

The translation works like this:

ring -> valid descriptors -> iovecs

We should have APIs for each step that work in batches.



> 
> > 
> > 
> > > > So packed layout should show the gain with this approach.
> > > > That could be motivation enough to finally enable vhost packed ring
> > > > support.
> > > > 
> > > > Thoughts?
> > > > 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ