lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 18 May 2016 12:55:17 +0300
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	Jason Wang <jasowang@...hat.com>, davem@...emloft.net,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next] tuntap: introduce tx skb ring

On Wed, May 18, 2016 at 11:21:29AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 18 May 2016 11:21:59 +0300
> "Michael S. Tsirkin" <mst@...hat.com> wrote:
> 
> > On Wed, May 18, 2016 at 10:16:31AM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > On Tue, 17 May 2016 09:38:37 +0800 Jason Wang <jasowang@...hat.com> wrote:
> > >   
> > > > >> And if tx_queue_length is not power of 2,
> > > > >> we probably need modulus to calculate the capacity.    
> > > > > Is that really that important for speed?    
> > > > 
> > > > Not sure, I can test.  
> > > 
> > > In my experience, yes, adding a modulus does affect performance.  
> > 
> > How about simple
> > 	if (unlikely(++idx > size))
> > 		idx = 0;
> 
> So, you are exchanging an AND-operation with a mask, for a
> branch-operation.  If the branch predictor is good enough in the CPU
> and code-"size" use-case, then I could be just as fast.
> 
> I've actually played with a lot of different approaches:
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue_helpers.h
> 
> I cannot remember the exact results. I do remember micro benchmarking
> showed good results with the advanced "unroll" approach, but IPv4
> forwarding, where I know I-cache is getting evicted, showed best
> results with the more simpler implementations.

This is all assuming you can somehow batch operations.
We can do this for transmit sometimes (when linux
is the source of the packets) but not always.

> 
> > > > 
> > > > Right, this sounds a good solution.  
> > > 
> > > Good idea.  
> > 
> > I'm not that sure - it's clearly wasting memory.
> 
> Rounding up to power of two.  In this case I don't think the memory
> wast is too high.  As we are talking about max 16 bytes elements.

It almost doubles it.
E.g. queue size of 10000 (rather common) will become 16K, wasting 6K.

> I am concerned about memory in another way. We need to keep these
> arrays/rings small, due to data cache usage.  A 4096 ring queue is bad
> because e.g. 16*4096=65536 bytes, and typical L1 cache is 32K-64K. As
> this is a circular buffer, we walk over this memory all the time, thus
> evicting the L1 cache.

Depends on the usage I guess.
Entries pointed to are much bigger, and you are
going to access them - is this really an issue?
If yes this shouldn't be that hard to fix ...

> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

Powered by blists - more mailing lists