netdev - Re: Initial thoughts on TXDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 2 Dec 2016 14:01:02 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Hannes Frederic Sowa <hannes@...essinduktion.org>
Cc:     brouer@...hat.com, Tom Herbert <tom@...bertland.com>,
        Florian Westphal <fw@...len.de>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Alexander Duyck <alexander.duyck@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        linux-mm <linux-mm@...ck.org>
Subject: Re: Initial thoughts on TXDP

On Thu, 1 Dec 2016 23:47:44 +0100
Hannes Frederic Sowa <hannes@...essinduktion.org> wrote:

> Side note:
> 
> On 01.12.2016 20:51, Tom Herbert wrote:
> >> > E.g. "mini-skb": Even if we assume that this provides a speedup
> >> > (where does that come from? should make no difference if a 32 or
> >> >  320 byte buffer gets allocated).

Yes, the size of the allocation from the SLUB allocator does not change
base performance/cost much (at least for small objects, if < 1024).

Do notice the base SLUB alloc+free cost is fairly high (compared to a
201 cycles budget). Especially for networking as the free-side is very
likely to hit a slow path.  SLUB fast-path 53 cycles, and slow-path
around 100 cycles (data from [1]).  I've tried to address this with the
kmem_cache bulk APIs.  Which reduce the cost to approx 30 cycles.
(Something we have not fully reaped the benefit from yet!)

[1] https://git.kernel.org/torvalds/c/ca257195511

> >> >  
> > It's the zero'ing of three cache lines. I believe we talked about that
> > as netdev.

Actually 4 cache-lines, but with some cleanup I believe we can get down
to clearing 192 bytes 3 cache-lines.

> 
> Jesper and me played with that again very recently:
> 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
> 
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?

Yes, I believe so.

> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
> 
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.

One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz),
Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle,
or 32 cycles for 256 bytes.

Talking to Alex and John during netdev, and reading on the Intel arch,
I though that this CPU should be-able-to perform 16 bytes per cycle.
The CPU can do it as the rep-stos show this once the size gets large
enough.

On this CPU the memset rep stos starts to win around 512 bytes:

 192/35 =  5.5 bytes/cycle
 256/36 =  7.1 bytes/cycle
 512/40 = 12.8 bytes/cycle
 768/46 = 16.7 bytes/cycle
1024/52 = 19.7 bytes/cycle
2048/84 = 24.4 bytes/cycle
4096/148= 27.7 bytes/cycle

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer