lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 14 Apr 2020 10:03:34 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Yi Yang (杨燚)-云服务集团 
        <yangyi01@...pur.com>
Cc:     "yang_y_yi@....com" <yang_y_yi@....com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "u9012063@...il.com" <u9012063@...il.com>
Subject: Re: [vger.kernel.org代发]Re: [vger.kernel.org代发]Re: [vger.kernel.org代 发]Re: [PATCH net-next] net/ packet: fix TPACKET_V3 perform ance issue in case of TSO

> > > iperf3 test result
> > > -----------------------
> > > [yangyi@...alhost ovs-master]$ sudo ../run-iperf3.sh
> > > iperf3: no process found
> > > Connecting to host 10.15.1.3, port 5201 [  4] local 10.15.1.2 port
> > > 44976 connected to 10.15.1.3 port 5201
> > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > [  4]   0.00-10.00  sec  19.6 GBytes  16.8 Gbits/sec  106586    307 KBytes
> > > [  4]  10.00-20.00  sec  19.5 GBytes  16.7 Gbits/sec  104625    215 KBytes
> > > [  4]  20.00-30.00  sec  20.0 GBytes  17.2 Gbits/sec  106962    301 KBytes
> >
> > Thanks for the detailed info.
> >
> > So there is more going on there than a simple network tap. veth, which calls netif_rx and thus schedules delivery with a napi after a softirq (twice), tpacket for recv + send + ovs processing. And this is a single flow, so more sensitive to batching, drops and interrupt moderation than a workload of many flows.
> >
> > If anything, I would expect the ACKs on the return path to be the more likely cause for concern, as they are even less likely to fill a block before the timer. The return path is a separate packet socket?
> >
> > With initial small window size, I guess it might be possible for the entire window to be in transit. And as no follow-up data will arrive, this waits for the timeout. But at 3Gbps that is no longer the case.
> > Again, the timeout is intrinsic to TPACKET_V3. If that is unacceptable, then TPACKET_V2 is a more logical choice. Here also in relation to timely ACK responses.
> >
> > Other users of TPACKET_V3 may be using fewer blocks of larger size. A change to retire blocks after 1 gso packet will negatively affect their workloads. At the very least this should be an optional feature, similar to how I suggested converting to micro seconds.
> >
> > [Yi Yang] My iperf3 test is TCP socket, return path is same socket as forward path. BTW this patch will retire current block only if vnet header is in packets, I don't know what else use cases will use vnet header except our user scenario. In addition, I also have more conditions to limit this, but it impacts on performance. I'll try if V2 can fix our issue, this will be only one way to fix our issue if not.
> >
>
> Thanks. Also interesting might be a short packet trace of packet arrival on the bond device ports, taken at the steady state of 3 Gbps.
> To observe when inter-arrival time exceeds the 167 usec mean. Also informative would be to learn whether when retiring a block using your patch, that block also holds one or more ACK packets along with the GSO packet. As their delay might be the true source of throttling the sender.
>
> I think we need to understand the underlying problem better to implement a robust fix that works for a variety of configurations, and does not causing accidental regressions. The current patch works for your setup, but I'm afraid that it might paper over the real issue.
>
> It is a peculiar aspect of TPACKET_V3 that blocks are retired not when a packet is written that fills them, but when the next packet arrives and cannot find room. Again, at sustained rate that delay should be immaterial. But it might be okay to measure remaining space after write and decide to retire if below some watermark. I would prefer that watermark to be a ratio of block size rather than whether the packet is gso or not.
>
> [Yi Yang] Sorry for late reply, I missed this email. I did do timing for every received frames, time interval is highly dynamic, I can't find any valuable clues, but I did find TCP ACK frames have big impact on performance, which are some small frames (size is not more than 100), in TPACKET_V3 case, a block will have a bunch of such TCP ACK frames, so these ACK frames aren't received and sent back to the receiver in time. I tried TPACKET_V2, its performance is beyond I expect, I tried it in kernel 5.5.9, its performance is better than this patch, about 11Gbps, I also tried kernel 4.15.0 (from Ubuntu, it actually cherry picked many fixed patches from upstream, so isn't official 4.15.0), its performance is about 14Gbps, worse than this patch (it is 17Gbps), so obviously the performance is kernel-related, platform related. In non-pmd case (i.e. sender and receiver are one thread and use the same CPU), TPACKET_V2 is much better then recvmmsg&sendmmsg. We decide to use TPACKET_V2 for TSO. But we don't know how we can reach higher performance than 14Gbps, it looks like tpacket_v2/v3's cache flush operation has side effect on performance (especially once flush per frame for TPACKET_V2)

Kernel 5.5.9 with TPACKET_V2 is better than this patch at 11 Gbps, but
Ubuntu 4.15.0 is worse that this patch at 14 Gbps (this patch is 17)?

How did you arrive at the conclusion that the cache flush operation is
the main bottleneck?

Good to hear that you verified that a main issue is the ACK delay.

Instead of packet sockets, you could also take a look at AF_XDP. There
seems to be documentation on how to deploy it with OVS.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ