netdev - Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTScQfrHFdYYuwB6kWezPLCxs5dQH-hk7Vt9D4SQLzcbLXg@mail.gmail.com>
Date:   Sun, 29 Mar 2020 21:51:33 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Yi Yang (杨燚)-云服务集团 
        <yangyi01@...pur.com>
Cc:     "willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>,
        "yang_y_yi@....com" <yang_y_yi@....com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "u9012063@...il.com" <u9012063@...il.com>
Subject: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 performance issue in case of TSO

On Sat, Mar 28, 2020 at 10:43 PM Yi Yang (杨燚)-云服务集团 <yangyi01@...pur.com> wrote:
>
> -----邮件原件-----
> 发件人: Willem de Bruijn [mailto:willemdebruijn.kernel@...il.com]
> 发送时间: 2020年3月29日 2:36
> 收件人: Yi Yang (杨燚)-云服务集团 <yangyi01@...pur.com>
> 抄送: willemdebruijn.kernel@...il.com; yang_y_yi@....com;
> netdev@...r.kernel.org; u9012063@...il.com
> 主题: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [PATCH net-next]
> net/ packet: fix TPACKET_V3 performance issue in case of TSO
>
> On Sat, Mar 28, 2020 at 4:37 AM Yi Yang (杨燚)-云服务集团
> <yangyi01@...pur.com> wrote:
> >
> <yangyi01@...pur.com> wrote:
> > > >
> > > > By the way, even if we used hrtimer, it can't ensure so high
> performance improvement, the reason is every frame has different size, you
> can't know how many microseconds one frame will be available, early timer
> firing will be an unnecessary waste, late timer firing will reduce
> performance, so I still think the way this patch used is best so far.
> > > >
> > >
> > > The key differentiating feature of TPACKET_V3 is the use of blocks to
> efficiently pack packets and amortize wake ups.
> > >
> > > If you want immediate notification for every packet, why not just use
> TPACKET_V2?
> > >
> > > For non-TSO packet, TPACKET_V3 is much better than TPACKET_V2, but for
> TSO packet, it is bad, we prefer to use TPACKET_V3 for better performance.
> >
> > At high rate, blocks are retired and userspace is notified as soon as a
> packet arrives that does not fit and requires dispatching a new block. As
> such, max throughput is not timer dependent. The timer exists to bound
> notification latency when packet arrival rate is slow.
> >
> > [Yi Yang] Per our iperf3 tcp test with TSO enabled, even if packet size is
> about 64K and block size is also 64K + 4K (to accommodate tpacket_vX
> header), we can't see high performance without this patch, I think some
> small packets before 64K big packets decide what performance it can reach,
> according to my trace, TCP packet size is increasing from less than 100 to
> 64K gradually, so it looks like how long this period took decides what
> performance it can reach. So yes, I don’t think hrtimer can help fix this
> issue very efficiently. In addition, I also noticed packet size pattern is
> 1514, 64K, 64K, 64K, 64K, ..., 1514, 64K even if it reaches 64K packet size,
> maybe that 1514 packet has big impact on performance, I just guess.
>
> Again, the main issue is that the timer does not matter at high rate.
> The 3 Gbps you report corresponds to ~6000 TSO packets, or 167 usec inter
> arrival time. The timer, whether 1 or 4 ms, should never be needed.
>
> There are too many unknown variables here. Besides block size, what is
> tp_block_nr? What is the drop rate? Are you certain that you are not causing
> drops by not reading fast enough? What happens when you increase
> tp_block_size or tp_block_nr? It may be worthwhile to pin iperf to one (set
> of) core(s) and the packet socket reader to another.
> Let it busy spin and do minimal processing, just return blocks back to the
> kernel.
>
> If unsure about that, it may be interesting to instrument the kernel and
> count how many block retire operations are from
> prb_retire_rx_blk_timer_expired and how many from tpacket_rcv.
>
> Note that do_vnet only changes whether a virtio_net_header is prefixed to
> the data. Having that disabled (the common case) does not stop GSO packets
> from arriving.
>
> [Yi Yang] You can refer to the patch
> (https://patchwork.ozlabs.org/patch/1257288/) for OVS DPDK for more details,
> tp_block_nr is 128 for TSO case, frame size is equal to block size, I tried
> increase block size to multiple frames, also tried bigger tp_block_nr, both of
> them won't have any help. For TSO, we have to have vnet header in frame,
> otherwise TSO won't work. Our user scenario is Openstack, but use OVS DPDK not
> OVS, no matter it is tap interface or veth interface, performance is very bad,
> because OVS DPDK is using RAW socket to receive packets from it and transmit
> packets to it for veth, our iperf3 tcp test case attached two veth interfaces
> to OVS DPDK bridge and set two veth peers to two network name spaces and run
> iperf3 client in one ns, run iperf3 server in another ns, the traffic will go
> back and forth two veth interfaces, OVS DPDK used TPACKE_V3 to forward packets
> between two veth interfaces.
>
>                                                TPACKET_V3       TPACKET_V3
> Here is an illustration for the traffic: ns01 <-> veth1 <-> vethbr1 <-> OVS
> DBDK Bridge <-> vethbr2 <-> veth2 <-> ns02
>
> I have used two pmd threads to handle vethbr1 and vethbr2 traffic,
> respectively, and pin them to core 2 and 3, respectively, iperf3 server and
> client are pinned to core 4 and 5, respectively, so the producer won't have
> buffer overflow issue, on the contrary, the consumer is starved, I tried to
> output tpacket stats information, no queen freeze, no packet drop, so I'm sure
> buffer is enough for it, I can see the consumer (pmd thread) is being starved
> because it can't receive packets in many loops, pmd threads are very fast,
> they don't have any other thing to do except receiving and transmitting
> packets.
>
> My test script for reference:
>
> #!/bin/bash
>
> DB_SOCK=unix:/var/run/openvswitch/db.sock
> OVS_VSCTL="/home/yangyi/workspace/ovs-master/utilities/ovs-vsctl --db=${DB_SOCK}"
>
> ${OVS_VSCTL} add-br br-int1 -- set bridge br-int1 datapath_type=netdev
> protocols=OpenFlow10,OpenFlow12,OpenFlow13
> ip link add veth1 type veth peer name vethbr1
> ip link add veth2 type veth peer name vethbr2
> ip netns add ns01
> ip netns add ns02
>
> ip link set veth1 netns ns01
> ip link set veth2 netns ns02
>
> ip netns exec ns01 ifconfig lo 127.0.0.1 up
> ip netns exec ns01 ifconfig veth1 10.15.1.2/24 up
> #ip netns exec ns01 ethtool -K veth1 tx off
>
> ip netns exec ns02 ifconfig lo 127.0.0.1 up
> ip netns exec ns02 ifconfig veth2 10.15.1.3/24 up
> #ip netns exec ns02 ethtool -K veth2 tx off
>
> ifconfig vethbr1 0 up
> ifconfig vethbr2 0 up
>
>
> ${OVS_VSCTL} add-port br-int1 vethbr1
> ${OVS_VSCTL} add-port br-int1 vethbr2
>
> ip netns exec ns01 ping 10.15.1.3 -c 3
> ip netns exec ns02 ping 10.15.1.2 -c 3
>
> killall iperf3
> ip netns exec ns02 iperf3 -s -i 10 -D -A 4
> ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 -A 5 --get-server-output
>
> ------------------------
> iperf3 test result
> -----------------------
> [yangyi@...alhost ovs-master]$ sudo ../run-iperf3.sh
> iperf3: no process found
> Connecting to host 10.15.1.3, port 5201
> [  4] local 10.15.1.2 port 44976 connected to 10.15.1.3 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes

Thanks for the detailed info.

So there is more going on there than a simple network tap. veth, which
calls netif_rx and thus schedules delivery with a napi after a softirq
(twice), tpacket for recv + send + ovs processing. And this is a
single flow, so more sensitive to batching, drops and interrupt
moderation than a workload of many flows.

If anything, I would expect the ACKs on the return path to be the more
likely cause for concern, as they are even less likely to fill a block
before the timer. The return path is a separate packet socket?

With initial small window size, I guess it might be possible for the
entire window to be in transit. And as no follow-up data will arrive,
this waits for the timeout. But at 3Gbps that is no longer the case.
Again, the timeout is intrinsic to TPACKET_V3. If that is
unacceptable, then TPACKET_V2 is a more logical choice. Here also in
relation to timely ACK responses.

Other users of TPACKET_V3 may be using fewer blocks of larger size. A
change to retire blocks after 1 gso packet will negatively affect
their workloads. At the very least this should be an optional feature,
similar to how I suggested converting to micro seconds.