netdev - IP (rtl8169) forwarding bug (performance)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7ba722c-023c-35dc-ab97-c778c7781413@gmail.com>
Date:   Mon, 10 Dec 2018 01:28:09 +0200
From:   Risto Pajula <or.pajula@...il.com>
To:     "David S. Miller" <davem@...emloft.net>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        hkallweit1@...il.com
Cc:     netdev@...r.kernel.org
Subject: IP (rtl8169) forwarding bug (performance)


Hello.

Old subject: "Re: IP fragmentation performance and don't fragment bug 
when forwarding


I have now been tracing the kernel and finding the bug seems difficult. 
I think the bug is combination of several things, likely cause is that 
it only occurs with rtl8169 and how it is using the netdevapi/DQL api.

 From my investigations seems that following happens:

*Burst of frames is received from internet. (eth0)
*These are accepted and forwarded to the (eth1)
*DQL hits the limit, this causes the scheduling for the device to be 
stopped (__QUEUE_STATE_STACK_XOFF)
*This and combination of some timing causes the  performance 
degradation, messages are only scheduled for transmission in soft Irq 
context.

I still do now know, why the DQL hits the limits.. I think it should 
not,  maybe something undesirable first happens with rtl8169.

I inserted following trace printk functions to the code..

rtl8169_start_xmit:
trace_printk("ORP33 %s rtl8169_start_xmit len: %d entry: %u cur_tx: %u 
frags: %d dirty_tx: %u tx_left: %u\n" , tp->dev->name, skb->len, entry, 
tp->cur_tx, frags, tp->dirty_tx, (tp->cur_tx - tp->dirty_tx) );


rtl_tx:
trace_printk("ORP33 %s rtl_tx len: %d entry: %u cur_tx: %u frags: %d 
dirty_tx: %u tx_left: %u\n" , tp->dev->name, tx_skb->skb->len, 
tp->cur_tx % NUM_TX_DESC, tp->cur_tx, 0, dirty_tx, tx_left );


Here is some grepped output only filtering the ping packet in two 
different situations:

//trace when downloading ubuntu (~20MB/s), only 3 secs fit to buffer. 
ping is all the time normal....
//orig tracing.out3754
117493:          <idle>-0     [000] ..s.  5235.407116: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 57 cur_tx: 3747641 frags: 0 dirty_tx: 
3747640 tx_left: 1
118704:          <idle>-0     [000] ..s.  5235.416057: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 20 cur_tx: 3747668 frags: 0 dirty_tx: 
3747667 tx_left: 1
346319:          hping3-2871  [000] ....  5236.002456: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 8 
cur_tx: 3754312 frags: 0 dirty_tx: 3754312 tx_left: 0
346337:          hping3-2871  [000] ..s.  5236.002475: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 9 cur_tx: 3754313 frags: 0 dirty_tx: 
3754312 tx_left: 1
1277155:          hping3-2871  [001] ....  5237.002627: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 38 
cur_tx: 3766502 frags: 0 dirty_tx: 3766502 tx_left: 0
1277173:          <idle>-0     [000] ..s.  5237.002646: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 39 cur_tx: 3766503 frags: 0 dirty_tx: 
3766502 tx_left: 1


//imcp messages from route to intenal network when TCP stream is on. 
tracing.out3236/eth1_ring.txt is the original
//no idea why the first frame is tripled...
200792:          hping3-2871  [001] ....  3239.593384: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 34 
cur_tx: 2132770 frags: 0 dirty_tx: 2132770 tx_left: 0
201578:          hping3-2871  [001] ....  3240.593625: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 40 
cur_tx: 2132776 frags: 0 dirty_tx: 2132776 tx_left: 0
341061:          <idle>-0     [000] ..s.  3241.738011: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 52 
cur_tx: 2133684 frags: 0 dirty_tx: 2133680 tx_left: 4
342112:          <idle>-0     [000] ..s.  3241.745297: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 59 cur_tx: 2133691 frags: 0 dirty_tx: 
2133684 tx_left: 7
500126:          <idle>-0     [000] ..s.  3242.739500: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 49 
cur_tx: 2134513 frags: 0 dirty_tx: 2134505 tx_left: 8
501739:          <idle>-0     [000] ..s.  3242.751654: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 58 cur_tx: 2134522 frags: 0 dirty_tx: 
2134513 tx_left: 9
646220:          <idle>-0     [000] ..s.  3243.745761: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 41 
cur_tx: 2135273 frags: 0 dirty_tx: 2135267 tx_left: 6
647633:          <idle>-0     [000] ..s.  3243.755486: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 49 cur_tx: 2135281 frags: 0 dirty_tx: 
2135273 tx_left: 8
802878:          <idle>-0     [000] ..s.  3244.739947: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 23 
cur_tx: 2136087 frags: 0 dirty_tx: 2136081 tx_left: 6
804298:          <idle>-0     [000] ..s.  3244.749677: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 31 cur_tx: 2136095 frags: 0 dirty_tx: 
2136087 tx_left: 8
961190:          <idle>-0     [000] ..s.  3245.746217: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 16 
cur_tx: 2136912 frags: 0 dirty_tx: 2136906 tx_left: 6
962610:          <idle>-0     [000] ..s.  3245.755946: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 24 cur_tx: 2136920 frags: 0 dirty_tx: 
2136912 tx_left: 8
1118044:          <idle>-0     [000] ..s.  3246.740336: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 62 
cur_tx: 2137726 frags: 0 dirty_tx: 2137720 tx_left: 6
1119443:          <idle>-0     [000] ..s.  3246.750047: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 5 cur_tx: 2137733 frags: 0 dirty_tx: 
2137726 tx_left: 7
1264047:          <idle>-0     [000] ..s.  3247.709202: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 29 
cur_tx: 2138525 frags: 0 dirty_tx: 2138518 tx_left: 7
1264740:          <idle>-0     [000] ..s.  3247.718918: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 33 cur_tx: 2138529 frags: 0 dirty_tx: 
2138525 tx_left: 4
1419958:          <idle>-0     [000] ..s.  3248.736436: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 48 
cur_tx: 2139312 frags: 0 dirty_tx: 2139305 tx_left: 7
1421551:          <idle>-0     [000] ..s.  3248.746189: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 56 cur_tx: 2139320 frags: 0 dirty_tx: 
2139312 tx_left: 8
1578746:          <idle>-0     [000] ..s.  3249.742702: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 46 
cur_tx: 2140142 frags: 0 dirty_tx: 2140135 tx_left: 7
1580194:          <idle>-0     [000] ..s.  3249.752457: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 54 cur_tx: 2140150 frags: 0 dirty_tx: 
2140142 tx_left: 8
1729597:          <idle>-0     [000] ..s.  3250.746839: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 36 
cur_tx: 2140964 frags: 0 dirty_tx: 2140957 tx_left: 7
1731015:          <idle>-0     [000] ..s.  3250.756594: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 43 cur_tx: 2140971 frags: 0 dirty_tx: 
2140964 tx_left: 7
1881028:          <idle>-0     [000] ..s.  3251.740991: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 19 
cur_tx: 2141779 frags: 0 dirty_tx: 2141772 tx_left: 7
1881381:          <idle>-0     [000] ..s.  3251.750718: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 26 cur_tx: 2141786 frags: 0 dirty_tx: 
2141779 tx_left: 7
//something happens, the output queue empties.. now it is fast, and 
rtl8169_start_xmit is not called from soft_irq
1896178:          hping3-2871  [001] ....  3252.595994: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 23 
cur_tx: 2141911 frags: 0 dirty_tx: 2141911 tx_left: 0
1896196:          <idle>-0     [000] ..s.  3252.596055: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 24 cur_tx: 2141912 frags: 0 dirty_tx: 
2141911 tx_left: 1
1923136:          hping3-2871  [001] ....  3253.596227: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 50 
cur_tx: 2142066 frags: 0 dirty_tx: 2142066 tx_left: 0
1923154:          <idle>-0     [000] ..s.  3253.596289: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 51 cur_tx: 2142067 frags: 0 dirty_tx: 
2142066 tx_left: 1
1951633:          hping3-2871  [001] ....  3254.596459: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 5 
cur_tx: 2142213 frags: 0 dirty_tx: 2142213 tx_left: 0
1951651:          <idle>-0     [000] ..s.  3254.596521: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 6 cur_tx: 2142214 frags: 0 dirty_tx: 
2142213 tx_left: 1
1954472:          hping3-2871  [001] ....  3255.596700: 
rtl8169_start_xmit: ORP33 eth1 rtl8169_start_xmit len: 54 entry: 15 
cur_tx: 2142223 frags: 0 dirty_tx: 2142223 tx_left: 0
1954490:          <idle>-0     [000] ..s.  3255.596763: rtl8169_poll: 
ORP33 eth1 rtl_tx len: 54 entry: 16 cur_tx: 2142224 frags: 0 dirty_tx: 
2142223 tx_left: 1
//until it soon, again is slow..


BR.
Risto



On 7.12.2018 16:46, Risto Pajula wrote:
> Hello.
> 
> I have been to track the poor forwarding latency to the TCP Window scale 
> options. The Netgem device uses rather large windows scale options 
> (x256) and I have been able to reproduce the routers poor forwarding 
> latency also with linux box running in the internal network and changing 
> the net.ipv4.tcp_rmem to a large value and thus changing the TCP window 
> scaling options to larger ones. I still do not have clue why this causes 
> the forwarfing in the linux kernel to block? Maybe something in the 
> connection tracking....?
> 
> 
> With the ICMP timestamp messages I have been able to also pinpoint that 
> the latency is caused in the eth1 sending side (the following hping3 
> example is run in the router toward the internal network...
> 
> 
> xxx:/usr/src/linux-4.20-rc2 # hping3 192.168.0.112 --icmp --icmp-ts -V
> using eth1, addr: 192.168.0.1, MTU: 1500
> HPING 192.168.0.112 (eth1 192.168.0.112): icmp mode set, 28 headers + 0 
> data bytes
> len=46 ip=192.168.0.112 ttl=64 id=49464 tos=0 iplen=40
> icmp_seq=0 rtt=7.9 ms
> ICMP timestamp: Originate=52294891 Receive=52294895 Transmit=52294895
> ICMP timestamp RTT tsrtt=7
> 
> len=46 ip=192.168.0.112 ttl=64 id=49795 tos=0 iplen=40
> icmp_seq=1 rtt=235.9 ms
> ICMP timestamp: Originate=52295891 Receive=52296128 Transmit=52296128
> ICMP timestamp RTT tsrtt=235
> 
> len=46 ip=192.168.0.112 ttl=64 id=49941 tos=0 iplen=40
> icmp_seq=2 rtt=3.8 ms
> ICMP timestamp: Originate=52296891 Receive=52296895 Transmit=52296895
> ICMP timestamp RTT tsrtt=3
> 
> len=46 ip=192.168.0.112 ttl=64 id=50685 tos=0 iplen=40
> icmp_seq=3 rtt=47.8 ms
> ICMP timestamp: Originate=52297891 Receive=52297940 Transmit=52297940
> ICMP timestamp RTT tsrtt=47
> 
> len=46 ip=192.168.0.112 ttl=64 id=51266 tos=0 iplen=40
> icmp_seq=4 rtt=7.7 ms
> ICMP timestamp: Originate=52298891 Receive=52298895 Transmit=52298895
> ICMP timestamp RTT tsrtt=7
> 
> len=46 ip=192.168.0.112 ttl=64 id=52245 tos=0 iplen=40
> icmp_seq=5 rtt=3.7 ms
> ICMP timestamp: Originate=52299891 Receive=52299895 Transmit=52299895
> ICMP timestamp RTT tsrtt=3
> 
> ^C
> --- 192.168.0.112 hping statistic ---
> 6 packets tramitted, 6 packets received, 0% packet loss
> round-trip min/avg/max = 3.7/51.1/235.9 ms
> 
> 
> 
> BR.
> Risto
> 
> On 2.12.2018 23:32, Risto Pajula wrote:
>> Hello.
>>
>> You can most likely ignore the "DF Bit, mtu bug when forwarding" case. 
>> There isn't actually big IP packets on the wire, instead there is 
>> burst of packets on the wire, which are combined by the GRO... And 
>> thus dropping them should not happen. Sorry about the invalid bug report.
>>
>> However the poor latency from intenal network to the internet still 
>> remain, both GRO enabled and disabled. I will try to study further...
>>
>>
>> BR.
>> Risto
>>
>>
>> On 2.12.2018 14:01, Risto Pajula wrote:
>>> Hello.
>>>
>>> I have encountered a weird performance problem in Linux IP 
>>> fragmentation when using video streaming services behind the NAT. 
>>> Also I have studied a possible bug in the DF bit (don't fragment) 
>>> handling when forwarding the IP packets.
>>>
>>> First the system setup description:
>>>
>>> [host1]-int lan-(eth1)[linux router](eth0)-extlan-[fibre 
>>> router]-internet
>>>
>>> where:
>>> host1: is a Netgem N7800 "cable box" for online video streaming 
>>> services provided by local telco (Can access Netflix, HBO nordic, 
>>> "live TV", etc.)
>>> linux router: Linux computer with Dualcore Intel Celeron G1840, 
>>> running currently Linux kernel 4.20.0-rc2, and openSUSE Leap 15.0
>>> eth1: Linux Routers internal (NAT) interface, 192.168.0.1/24 network, 
>>> mtu set to 1500, RTL8169sb/8110sb
>>> eth0: Linux Routers internet facing interface, public ip address, mtu 
>>> set to 1500,  RTL8168evl/8111evl
>>> fibre router: Alcatel Lucent fibre router (I-241G-Q), directly 
>>> connected to the eth0 of the Linux router.
>>>
>>> And now when using the Netgem N7800 with online video services 
>>> (Netflix, HBO nordic, etc) the Linux router will receive very BIG IP 
>>> packets in the eth0 upto ~20kB, this seems to lead to the following 
>>> problems in the Linux IP stack.
>>>
>>> IP fragmentation performance:
>>> When the Linux router receives these large IP packets in the eth0 
>>> everything works, but it seems that them cause very large performance 
>>> degradation from internal network to the internet regarding the 
>>> latency when the IP fragmentation is performed. The ping latency from 
>>> internal network to the internel network increases from stable 
>>> 15ms-20ms up to 700-800ms AND also the ping from the internal network 
>>> to the linux router eth1 (192.168.0.). However up link works 
>>> perfectly, the ping is still stable when streaming the online 
>>> services (From linux router to the internet). It seems that the IP 
>>> fragmentation is somehow blocking the eth1 reception or transmission 
>>> for very long time (which it shouldn't). I'm able to test and debug 
>>> the issue further, but advice regarding where to look would be 
>>> appreciated.
>>>
>>>
>>> DF Bit, mtu bug when forwarding:
>>> I have started to study the above mentioned problem and have found a 
>>> possible bug in the DF bit and mtu handling in IP forwarding. The BIG 
>>> packets received from streaming services all have the "DF bit" set 
>>> and the question is that should we be forwarding them at all as that 
>>> would result them being fragmented? Apparently we currently are... I 
>>> have traced this down to the ip_forward.c function ip_exceeds_mtu(), 
>>> and the following patch seems to fix that.
>>>
>>> --- net/ipv4/ip_forward.c.orig  2018-12-02 11:09:32.764320780 +0200
>>> +++ net/ipv4/ip_forward.c       2018-12-02 12:53:25.031232347 +0200
>>> @@ -49,7 +49,7 @@ static bool ip_exceeds_mtu(const struct
>>>                 return false;
>>>
>>>         /* original fragment exceeds mtu and DF is set */
>>> -       if (unlikely(IPCB(skb)->frag_max_size > mtu))
>>> +        if (unlikely(skb->len > mtu))
>>>                 return true;
>>>
>>>         if (skb->ignore_df)
>>>
>>>
>>> This seems to work (in some ways) - after the change IP packets that 
>>> are too large to the internal network get dropped and we are sending 
>>> "ICMP Destination unreachable, The datagram is too big" messages to 
>>> the originator (as we should?). However it seems that not all 
>>> services really like this... Netflix behaves as expected and ping is 
>>> stable from internal network to the internet, but for example HBO 
>>> nordic will not work anymore (too little buffering? Retransimissions 
>>> not working?). So it seems the original issue should be also fixed 
>>> (And the fragmention should be allowed?).
>>>
>>>
>>>
>>> Any advice would be appreciated. Thanks!
>>>
>>> PS. Watching TV was not this intensive 20 years ago :)
>>>