lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230322193309.GA32681@debian>
Date:   Wed, 22 Mar 2023 20:33:11 +0100
From:   Richard Gobert <richardbgobert@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>, pabeni@...hat.com
Cc:     Paolo Abeni <pabeni@...hat.com>, davem@...emloft.net,
        kuba@...nel.org, dsahern@...nel.org, alexanderduyck@...com,
        lucien.xin@...il.com, lixiaoyan@...gle.com, iwienand@...hat.com,
        leon@...nel.org, ye.xingchen@....com.cn, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

> On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > Currently the IPv6 extension headers are parsed twice: first in
> > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > >
> > > By using the new ->transport_proto field, and also storing the size of the
> > > network header, we can avoid parsing extension headers a second time in
> > > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > > IPv6 packets).
> > >
> > > The implementation had to handle both inner and outer layers in case of
> > > encapsulation (as they can't use the same field). I've applied a similar
> > > optimisation to Ethernet.
> > >
> > > Performance tests for TCP stream over IPv6 with a varying amount of
> > > extension headers demonstrate throughput improvement of ~0.7%.
> >
> > I'm surprised that the improvement is measurable: for large aggregate
> > packets a single ipv6_exthdrs_len() call is avoided out of tens calls
> > for the individual pkts. Additionally such figure is comparable to
> > noise level in my tests.

It's not simple but I made an effort to make a quiet environment.
Correct configuration allows for this kind of measurements to be made
as the test is CPU bound and noise is a variance that can be reduced with 
enough samples.

Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
gen)

    # power-management and hyperthreading disabled in BIOS
    # sysctl preallocate net mem
    echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable turboboost
    ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE frames

    # Single core performance
    for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0 >"$x"; done

    ./network-testing-master/bin/netfilter_unload_modules.sh 2>/dev/null # unload netfilter
    tuned-adm profile latency-performance
    cpupower frequency-set -f 2200MHz # Set core to specific frequency
    systemctl isolate rescue-ssh.target
    # and kill all processes besides init

> > This adds a couple of additional branches for the common (no extensions
> > header) case.

The additional branch in ipv6_gro_receive would be negligible or even
non-existent for a branch predictor in the common case
(non-encapsulated packets).
I could wrap it with a likely macro if you wish.
Inside ipv6_gro_complete a couple of branches are saved for the common
case as demonstrated below.

original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):

    // if (skb->encapsulation)

    ffffffff81c4962b:	f6 87 81 00 00 00 20 	testb  $0x20,0x81(%rdi)
    ffffffff81c49632:	74 2a                	je     ffffffff81c4965e <ipv6_gro_complete+0x3e>

    ...

    // nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);

    ffffffff81c4969c:	eb 1b                	jmp    ffffffff81c496b9 <ipv6_gro_complete+0x99>    <-- jump to beginning of for loop
    ffffffff81c4968e:   b8 28 00 00 00          mov    $0x28,%eax
    ffffffff81c49693:   31 f6                   xor    %esi,%esi
    ffffffff81c49695:   48 c7 c7 c0 28 aa 82    mov    $0xffffffff82aa28c0,%rdi
    ffffffff81c4969c:   eb 1b                   jmp    ffffffff81c496b9 <ipv6_gro_complete+0x99>
    ffffffff81c4969e:   f6 41 18 01             testb  $0x1,0x18(%rcx)
    ffffffff81c496a2:   74 34                   je     ffffffff81c496d8 <ipv6_gro_complete+0xb8>    <--- 3rd conditional check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
    ffffffff81c496a4:   48 98                   cltq  
    ffffffff81c496a6:   48 01 c2                add    %rax,%rdx
    ffffffff81c496a9:   0f b6 42 01             movzbl 0x1(%rdx),%eax
    ffffffff81c496ad:   0f b6 0a                movzbl (%rdx),%ecx
    ffffffff81c496b0:   8d 04 c5 08 00 00 00    lea    0x8(,%rax,8),%eax
    ffffffff81c496b7:   01 c6                   add    %eax,%esi
    ffffffff81c496b9:   85 c9                   test   %ecx,%ecx                                    <--- for loop starts here
    ffffffff81c496bb:   74 e7                   je     ffffffff81c496a4 <ipv6_gro_complete+0x84>    <--- 1st conditional check: proto != NEXTHDR_HOP
    ffffffff81c496bd:   48 8b 0c cf             mov    (%rdi,%rcx,8),%rcx
    ffffffff81c496c1:   48 85 c9                test   %rcx,%rcx
    ffffffff81c496c4:   75 d8                   jne    ffffffff81c4969e <ipv6_gro_complete+0x7e>    <--- 2nd conditional check: unlikely(!(*opps))
    
    ... (indirect call ops->callbacks.gro_complete)

ipv6_exthdrs_len contains a loop which has 3 conditional checks.
For the common (no extensions header) case, in the new code, *all 3
branches are completely avoided*

patched code ipv6_gro_complete:

    // if (skb->encapsulation)
    ffffffff81befe58:   f6 83 81 00 00 00 20    testb  $0x20,0x81(%rbx)
    ffffffff81befe5f:   74 78                   je     ffffffff81befed9 <ipv6_gro_complete+0xb9>
    
    ...
    
    // else
    ffffffff81befed9:	0f b6 43 50          	movzbl 0x50(%rbx),%eax
    ffffffff81befedd:	0f b7 73 4c          	movzwl 0x4c(%rbx),%esi
    ffffffff81befee1:	48 8b 0c c5 c0 3f a9 	mov    -0x7d56c040(,%rax,8),%rcx
    
    ... (indirect call ops->callbacks.gro_complete)

Thus, the patch is beneficial for both the common case and the ext hdr
case. I would appreciate a second consideration :)

> > while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
> > me.
> >
> > I suggest to re-post for inclusion only patch 1, unless others have
> > strong different opinions.
> >
> 
> +2
> 
> I have the same feeling/opinion.
>
> > Cheers,
> >
> > Paolo
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ