netdev - UDP sockets oddities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 23 Aug 2017 13:02:28 -0700
From:   Florian Fainelli <f.fainelli@...il.com>
To:     netdev@...r.kernel.org, edumazet@...il.com, pabeni@...hat.com,
        willemb@...gle.com
Cc:     davem@...emloft.net
Subject: UDP sockets oddities

Hi,

On Broadcom STB chips using bcmsysport.c and bcm_sf2.c we have an out of
band HW mechanism (not using per-flow pause frames) where we can have
the integrated network switch backpressure the CPU Ethernet controller
which translates in completing TX packets interrupts at the appropriate
pace and therefore get flow control applied end-to-end from the host CPU
port towards any downstream port. At least that is the premise and this
works reasonably well.

This has a few drawbacks in that each of the bcmsysport TX queues need
to semi-statically map to their switch port output queues such that the
switch can calculate buffer occupancy and report congestion status,
which prompted this email [1] but this is tangential and is a policy not
a mechanism issue.

[1]: https://www.spinics.net/lists/netdev/msg448153.html

This is useful when your CPU / integrated switch links up at 1Gbits/sec
internally, and tries to push 1Gbits/sec worth of UDP traffic to e.g: a
downstream port linking at 100Mbits/sec, which could happen depending on
what you have connected to this device.

Now the problem that I am facing, is the following:

- net.core.wmem_default = 160KB (default value)
- using iperf -b 800M -u towards an iperf UDP server with the physical
link to that server established at 100Mbits/sec
- iperf does synchronous write(2) AFAICT so this gives it flow control
- using the default duration of 10s, you can barely see any packet loss
from one run to another
- the longer the run, the higher you are going to see some packet loss,
usually in the range of ~0.15% top

The transmit flow looks like this:

gphy (net/dsa/slave.c::dsa_slave_xmit, IFF_NO_QUEUE device)
 -> eth0 (drivers/net/ethernet/broadcom/bcmsysport.c, "regular" network
device)

I can clearly see that the network stack pushed N UDP packets (Udp and
Ip counters in /proc/net/snmp concur) however what the driver
transmitted and what the switch transmistted is N - M, and matches the
packet loss reported by the UDP server. I don't measure any SndbufErrors
which is not making sense yet.

If I reduce the default socket size to say, 10x less than 160KB, 16KB,
then I either don't see any packet loss at 100Mbits/sec for 5 minutes or
more, or just very very little, down to 0.001%. Now if I repeat the
experiment with the physical link at 10Mbits/sec, same thing, the 16KB
wmem_default setting is no longer working and we need to lower the
socket write buffer size again.

So what I am wondering is:

- do I have an obvious flow control problem in my network driver that
usually does not lead to packet loss, but may sometime happen?

- why would lowering the socket write size appear to masquerade or solve
this problem?

I can consistently reproduce this across several kernel versions, 4.1,
4.9 and latest net-next and therefore can also test patches.

Thanks for reading thus far!
-- 
Florian